minimax m2 REAP + unsloth 1,1.58,2,4bit dynamic ggufs targeting compute constrained context lengths
Optimized around reasonable context lengths available on 24gb RTX4090s or the 24gb 50 series cards, where you dynamically quantize using data that is that specific x-bit quantizations max context length on a 24gb and a the 5090 which I think 40gb vram. For example maybe you can run 1bit gguf at like 32k context on a 4090 and say 65k on a 5090, totally arbitrary optimistic guesses. So while quantizing you try to quantize on high quality data that is as close to those two maxes as possible, while focusing on consistent context relevancy to intent of overall prompt. ie tokens 4k-20k are just as important (measured by attention amplitudes probably) as 24k-32k or whatever you decide is the best way to determine attention consistency across free form contexts, theres probably lots of bucketing techniques, some optimized for low context, like long tail of long context expecting models, or more generally consistent methodologies targeting specifically long context. I would prefer something along those lines if this idea interests you, but whatever you think will serve your end users better based on your own serving architecture of course. If you did end up going this way maybe there is some magic second REAP extrapolation meet quantization level awareness that could help you do some form of like expert-specific reaping/quantization/parameter blurring/downscaling techniques that can even boost performance more. Like some level of context-length aware quantized REAP or idk. If you dont think this is a reasonable request or wont follow through on it, can you instruct me with minimal linkage to paper, repo, and any other links you think may be particularly useful for accomplishing this myself using minimal cloud compute, if you want to recommend a cloud provider thats nice too, but Ill probably shop around if I end having to do this myself. I guess its compute constrained max context aware and REAP aware quantization at the end of the pipeline and more like context-length and location aware REAP prior to quantization steps. Im decently familiar with unsloth ecosystem and am fairly confident I can figure out the quantization partition of steps on my own, including targeting their dynamic quantization paper towards long context optimization and consistency, or adapting any in-ecosystem public releases they have on their quantization method towards context-tied optimization depending on the level of implementation (if it's all triton kernels, well then I mean I guess its a good time to learn those too), so really I would just need some input from you on going about tidying REAP repo to the task of REAPing the new minimax model, and some insights into how to target towards long-contexts more forcibly.
Side note Im curious from the Cerebras side of things, how has adapting the new ~linear inference models like deepseek-3.2-exp or kimi-linear been going? And how have you thought more generally about hardware specifically geared, possibly to those architectures, for optimal efficient massive context lengths near term. Like if we start getting into 100million context lengths next year or the year after, 1000 tokens/second is still like 15 minutes per prompt, plus based on current scaling it seems that you would be lucky to get 1000 tokens / second at those massive context lengths. Personally I believe something like a form of dynamic/predictive prompt caching meets speculative decoding probably helps alleviate the scaling issue for just speed slow down at larger context inputs, but still if the models can output on that scale too, then the speed just doesnt cut it any more. Also what even would we be making that is 100m text tokens. And somehow this wasnt considering the idea of visual tokens for text embeddings/context compression like deepseek-OCR, which may or may not be tied to long context speed slowdown scaling through diffusion models somehow for the dymanic prompt caching whos really to say. I just make predictions like min-data for pretraining scaling down massively thanks to lottery ticket hypothesis-like idea of lottery tokens/samples which massively lead to quick jumps across latent space, think like intelligently targetted quantum tunneling through hyperdimensional space to avoid the curse of dimensionality or something, leading to larger scale auto-experimenting with massive ablation sets and things like that. That prediction was dec 2023 I think and also paired with agentic interfaces AIUIs for synthetic dataset creation, but I havent seen many of those yet. Ie design the ui for automatic lottery token creation, then meta-optimize synth data creating model training process based on some form of magnitude shift in total data needed effect each sample has, and we have a way to optimize towards min-data required for massive pretraining studies. Im sort of already like 3 steps beyond this in my own studies, but at the same time the REAP stuff and all the lower-level hardware/architecture stuff still needs doing. If only I had some way of generating training data fast with like current models in an iteratively replaceable manner, like a fine-tuning ready quick swap inference api or something idk maybe unsloth knows something about this, they seem more on the side of training anyway, but with the new paradigm soon to be here of what would feel instant to todays standards of pretraining, its probably going to feel like training and inference become the same beast and just sort of converge into some form of continual learning.
TL;DR I'd help you with trying to make long-context targeted REAP sets of quantizations of the new minimax model, and possibly the qwen3-vl models up to 32B as practice before going all the way to the ~200b I think it is minimax model. Also, I might've revealed too many future things, ignore what seems more than like 6-8 months out I guess.
I hope this interests you. Happy Halloween Haha.
TimeLord Out.