Papers
arxiv:2511.19504

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Published on Nov 23
· Submitted by Aman Chadha on Nov 27
Authors:
,
,

Abstract

The Alignment Trilemma in RLHF shows that achieving representativeness, tractability, and robustness is computationally infeasible, leading to trade-offs in current implementations.

AI-generated summary

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

Community

Paper author Paper submitter
edited 5 days ago

This paper formalizes an Alignment Trilemma, proving that no RLHF-based alignment strategy can simultaneously achieve ε\varepsilon-representativeness, polynomial-time tractability, and δ\delta-robustness, with any two of these goals implying exponential cost in the third.

4


➡️ Key Highlights of the Alignment Trilemma Framework:
🧠 Formalization of Alignment Constraints: The paper rigorously defines ε\varepsilon-representativeness (alignment fidelity across diverse human values), polynomial tractability (sample and compute complexity), and δ\delta-robustness (resilience to adversarial perturbations). It proves that satisfying all three simultaneously is impossible for large populations and high-dimensional context spaces, i.e., achieving both small ε\varepsilon and δ\delta requires Ω(2dcontext)\Omega(2^{d_{\text{context}}}) operations.

📈 Complexity-Theoretic Lower Bounds on Scalability: The authors show that alignment requires operations scaling as Ω(κ2dcontext/(ε2nδ))\Omega(\kappa \cdot 2^{d_{\text{context}}} / (\varepsilon^2 n \delta)), where dcontextlognd_{\text{context}} \gg \log n. This implies that as model context spaces or population diversity grow, alignment becomes super-polynomial in cost, rendering naive scaling approaches ineffective for global representational alignment.

⚖️ Practical Trade-off Analysis in Current RLHF Pipelines: The study maps how existing RLHF systems navigate the trilemma: choosing small, homogeneous annotator pools (typically 10310^3–\(10^4\) samples) and strong KL penalties to maintain tractability and partial robustness, at the cost of representativeness. This design leads directly to known pathologies such as sycophancy, reward hacking, and collapse of minority values—shown here to be inevitable outcomes of trilemma constraints.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.19504 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.19504 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.19504 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.