A slightly smaller 4Bit quant is needed.
#2
by
amit864
- opened
This model has a 256k context window. Not sure if this would run on a single h100 or rtx pro 6000. Perhaps the quantization can be a bit more aggresive to achieve running this on a single GPU (thus halving the inference cost). Currently the model weights take up almost 80Gigs of memory.