Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Paper
•
2408.13359
•
Published
•
23
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning. Paper: https://arxiv.org/abs/2408.13359
This is a GGUF quantized version.
Requires latest llama.cpp to run.
This is a simple example of how to use the PowerMoe GGUF:
./llama-cli -m PowerMoE4x800M_q3km.gguf -p "How about a snack?"
We're not able to determine the quantization variants.
Base model
ibm-research/PowerMoE-3b