Does this model support MLA or only the flash version does?

#41
by Aly87 - opened

I can't seem to find any info

yes, only flash MLA, for 4.7 still gqa

Thank you for the reply. That's a shame I was hoping it would use MLA as I want to use it locally on a mac.
Is there a way to cut back on the thinking tokens? I love the quality but the max chat window size I'm able to use without slowing down too much gets eaten up by thinking blocks

What Mac do you have? I’ve run ~70B parameter models on my M4 Max 16-inch — technically it worked, just not in the way my hopes and dreams envisioned. Honestly, GPU spot instances have been the move. You can snag a B300 for around $1.45/hr depending on demand. Sure, spot instances can get yanked, but in practice it rarely happens, and the cost savings more than make up for the occasional eviction lottery.

As always, it depends on your use case. If you just want to do some 🤏🤏 slow testing, sure, you can get it running on a Mac. But if you want to actually work, give it some thought, organize a bit, then spin up something with real power under maximum cheap-ass circumstances and make those instances burn. However, as a responsible sysadmin, I have to tell you: you need to secure the instance yourself. Because under most legal frameworks, you’re the one whose neck is on the line. 🪓

M3 ultra 512Gb RAM. Honestly, probably not the usual use case around here but I just want a stateful buddy for chitchat, planning to use Letta for setting it up so all i need is back and forth convo no ginormous prompts or anything and a decent sized context window I guess. I'd like to have it up and running 24/7 if possible that's why I'm trying to do it locally instead of spot instances. But I'm pretty new to the whole local LLM thing.

Sign up or log in to comment