Papers
arxiv:2512.12880

Improving Recursive Transformers with Mixture of LoRAs

Published on Dec 14
· Submitted by
Omid Rohanian
on Dec 19
Authors:
,
,

Abstract

Mixture of LoRAs within a shared feed-forward network restores expressivity in parameter-shared recursive transformers, achieving state-of-the-art performance with compact models.

AI-generated summary

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

Community

Recursive transformers cut model size by sharing parameters across layers, but this sharing tends to collapse layer-wise expressivity and makes the model less flexible. We propose Mixture of LoRAs (MoL), a lightweight fix that replaces selected shared FFNs with a small set of token-routed LoRA experts (sparse routing), allowing conditional computation while keeping the backbone compact. We pretrain ModernALBERT (50M to 120M) with RoPE, GeGLU, FlashAttention, and distillation-based initialisation, and report state-of-the-art results among compact models on GLUE, SQuAD-v2, and BEIR, often surpassing larger fully parameterised baselines. For deployment, we introduce expert merging (including an EMA-based strategy) that compresses MoL into a single adapter at inference, removing routing overhead.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.12880 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.12880 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.12880 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.