Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
Abstract
The Alignment Trilemma in RLHF shows that achieving representativeness, tractability, and robustness is computationally infeasible, leading to trade-offs in current implementations.
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.
Community
This paper formalizes an Alignment Trilemma, proving that no RLHF-based alignment strategy can simultaneously achieve -representativeness, polynomial-time tractability, and -robustness, with any two of these goals implying exponential cost in the third.
➡️ Key Highlights of the Alignment Trilemma Framework:
🧠 Formalization of Alignment Constraints: The paper rigorously defines -representativeness (alignment fidelity across diverse human values), polynomial tractability (sample and compute complexity), and -robustness (resilience to adversarial perturbations). It proves that satisfying all three simultaneously is impossible for large populations and high-dimensional context spaces, i.e., achieving both small and requires operations.
📈 Complexity-Theoretic Lower Bounds on Scalability: The authors show that alignment requires operations scaling as , where . This implies that as model context spaces or population diversity grow, alignment becomes super-polynomial in cost, rendering naive scaling approaches ineffective for global representational alignment.
⚖️ Practical Trade-off Analysis in Current RLHF Pipelines: The study maps how existing RLHF systems navigate the trilemma: choosing small, homogeneous annotator pools (typically –\(10^4\) samples) and strong KL penalties to maintain tractability and partial robustness, at the cost of representativeness. This design leads directly to known pathologies such as sycophancy, reward hacking, and collapse of minority values—shown here to be inevitable outcomes of trilemma constraints.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference (2025)
- Rectifying Shortcut Behaviors in Preference-based Reward Learning (2025)
- Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration (2025)
- A Granular Study of Safety Pretraining under Model Abliteration (2025)
- Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards (2025)
- Rethinking Deep Alignment Through The Lens Of Incomplete Learning (2025)
- AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
