EfficientMoE
Collection
9 items
β’
Updated
π€ HuggingFace | π Tech Report
We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface.
| Model Name | # Act. Params | FID-50Kβ | Inception Scoreβ |
|---|---|---|---|
| DiffMoE-S-E16 | 32M | 41.02 | 37.53 |
| DSMoE-S-E16 | 33M | 39.84 | 38.63 |
| DSMoE-S-E48 | 30M | 40.20 | 38.09 |
| DiffMoE-B-E16 | 130M | 20.83 | 70.26 |
| DSMoE-B-E16 | 132M | 20.33 | 71.42 |
| DSMoE-B-E48 | 118M | 19.46 | 72.69 |
| DiffMoE-L-E16 | 458M | 11.16 (14.41*) | 107.74 (88.19*) |
| DSMoE-L-E16 | 465M | 9.80 | 115.45 |
| DSMoE-L-E48 | 436M | 9.19 | 118.52 |
| DSMoE-3B-E16 | 965M | 7.52 | 135.29 |
| Model Name | # Act. Params | FID-50Kβ | Inception Scoreβ |
|---|---|---|---|
| DiffMoE-S-E16 | 32M | 15.47 | 94.04 |
| DSMoE-S-E16 | 33M | 14.53 | 97.55 |
| DSMoE-S-E48 | 30M | 14.81 | 96.51 |
| DiffMoE-B-E16 | 130M | 4.87 | 183.43 |
| DSMoE-B-E16 | 132M | 4.50 | 186.79 |
| DSMoE-B-E48 | 118M | 4.27 | 191.03 |
| DiffMoE-L-E16 | 458M | 2.84 | 256.57 |
| DSMoE-L-E16 | 465M | 2.59 | 272.55 |
| DSMoE-L-E48 | 436M | 2.55 | 278.35 |
| DSMoE-3B-E16 | 965M | 2.38 | 304.93 |
| Model Name | # Act. Params | FID-50Kβ | Inception Scoreβ |
|---|---|---|---|
| JiT-B/16 | 131M | 4.81 (4.37*) | 222.32 (-) |
| JiTMoE-B/16-E16 | 133M | 4.23 | 245.53 |
| JiT-L/16 | 459M | 3.19 (2.79*) | 309.72 (-) |
| JiTMoE-L/16-E16 | 465M | 3.10 | 311.34 |
@article{liu2025efficient,
title={Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe},
author={Liu, Yahui and Yue, Yang and Zhang, Jingyuan and Sun, Chenxi and Zhou, Yang and Zeng, Wencong and Tang, Ruiming and Zhou, Guorui},
journal={arXiv preprint arXiv:2512.01252},
year={2025}
}