A slightly smaller 4Bit quant is needed.

by amit864 - opened 11 days ago

11 days ago

This model has a 256k context window. Not sure if this would run on a single h100 or rtx pro 6000. Perhaps the quantization can be a bit more aggresive to achieve running this on a single GPU (thus halving the inference cost). Currently the model weights take up almost 80Gigs of memory.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment