arxiv:2512.12880

Improving Recursive Transformers with Mixture of LoRAs

Published on Dec 14

· Submitted by

Omid Rohanian on Dec 19

NLPIE Research

Upvote

Authors:

Abstract

Mixture of LoRAs within a shared feed-forward network restores expressivity in parameter-shared recursive transformers, achieving state-of-the-art performance with compact models.

AI-generated summary

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

View arXiv page View PDF Add to collection

Community

omidrohanian

Paper submitter about 4 hours ago

Recursive transformers cut model size by sharing parameters across layers, but this sharing tends to collapse layer-wise expressivity and makes the model less flexible. We propose Mixture of LoRAs (MoL), a lightweight fix that replaces selected shared FFNs with a small set of token-routed LoRA experts (sparse routing), allowing conditional computation while keeping the backbone compact. We pretrain ModernALBERT (50M to 120M) with RoPE, GeGLU, FlashAttention, and distillation-based initialisation, and report state-of-the-art results among compact models on GLUE, SQuAD-v2, and BEIR, often surpassing larger fully parameterised baselines. For deployment, we introduce expert merging (including an EMA-based strategy) that compresses MoL into a single adapter at inference, removing routing overhead.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.12880 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.12880 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.12880 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.