A slightly smaller 4Bit quant is needed.

#2
by amit864 - opened

This model has a 256k context window. Not sure if this would run on a single h100 or rtx pro 6000. Perhaps the quantization can be a bit more aggresive to achieve running this on a single GPU (thus halving the inference cost). Currently the model weights take up almost 80Gigs of memory.

Sign up or log in to comment