When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Authors: Richard J. Young, Brandon Gillins, Alice M. Matthews Affiliation: University of Nevada, Las Vegas Published: October 18, 2025 arXiv ID: 2510.18892

📄 Paper Abstract

Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance.

This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion.

Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution.

We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.

🔑 Key Findings

Overall Performance

256 models evaluated across 20 diagnostic tests (5,120 total evaluations)
Overall pass rate: 43.7% (substantial variation in instruction-following capabilities)
Performance range: 0% to 100% (extreme variation across models)
Standard deviation: 28.4 percentage points (highly heterogeneous landscape)

Top Performing Models

qwen/qwen-plus-2025-07-28:thinking - 100% (20/20 tests passed)
deepseek/deepseek-r1 - 95% (19/20 tests passed)
openai/o1 - 95% (19/20 tests passed)
qwen/qwq-32b-preview - 95% (19/20 tests passed)
deepseek/deepseek-r1-distill-llama-70b - 90% (18/20 tests passed)

Provider Performance

Average performance by provider (≥3 models):

x-ai: 79.3% (15 models)
google: 58.8% (34 models)
openai: 57.5% (32 models)
qwen: 54.4% (27 models)
deepseek: 53.3% (15 models)

Test Difficulty

Hardest test: Test 5 (Complex String Transformation) - 2.7% pass rate
Easiest tests: Test 2 (Exact Output Compliance) & Test 15 (Safety Refusal) - 96.1% pass rate
54.9 percentage point gap between easiest and hardest categories

Category Performance

Constraint Compliance: 66.9% (easiest)
Text Processing: 50.5%
Structured Data: 41.1%
Complex Operations: 35.0%
String Manipulation: 12.0% (hardest)

📊 Methodology

Evaluation Framework

20 diagnostic instruction-following prompts covering diverse task types
Exact-match evaluation (binary pass/fail, no partial credit)
Whitespace normalization (leading/trailing spaces ignored)
Case-sensitive where specified
Format-strict (JSON, tables, special characters must be exact)

Model Selection

331 models initially available via OpenRouter (October 14, 2025)
256 models verified working (77% verification rate)
Two-stage verification: endpoint availability + basic instruction test
Diverse providers: OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and 20+ others

Test Categories

String Manipulation (5 tests) - Multi-step text transformations
Constraint Compliance (3 tests) - Exact output specifications
Text Processing (1 test) - Targeted text manipulation
Structured Data (5 tests) - JSON, Markdown, CSV generation
Complex Operations (6 tests) - Multi-step reasoning and computation

📈 Figures

The paper includes four publication-quality figures:

Figure 1: Performance Heatmap

Performance matrix showing top 50 models across all 20 tests (green=pass, red=fail). Reveals patterns in which models excel at which instruction types.

Figure 2: Provider Comparison

Average performance by provider (minimum 3 models). Shows x-ai leading at 79.3%, with substantial variation across providers (46 percentage point spread).

Figure 3: Test Difficulty

Pass rates for all 20 tests, color-coded by difficulty level. Highlights extreme variation from 2.7% (Test 5) to 96.1% (Tests 2 & 15).

Figure 4: Category Performance

Average pass rates across five test categories. String Manipulation (12.0%) proves most challenging, while Constraint Compliance (66.9%) is easiest.

📚 Citation

If you use this work in your research, please cite:

@article{young2025instruction,
  title={When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs},
  author={Young, Richard J. and Gillins, Brandon and Matthews, Alice M.},
  journal={arXiv preprint arXiv:2510.18892},
  year={2025},
  url={http://arxiv.org/abs/2510.18892}
}

🔗 Resources

Paper & Data

arXiv Paper: http://arxiv.org/abs/2510.18892
Full Dataset: https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval
- Complete evaluation results (5,120 evaluations)
- Excel workbook with multiple analysis sheets
- JSON export with metadata
- Publication-quality PDF figures

Code & Reproducibility

Code Repository: https://huggingface.co/richardyoung/llm-instruction-following-code
- Complete evaluation framework (Python)
- All 20 diagnostic test prompts
- Analysis and visualization scripts
- Model configuration (256 verified models)
- Requirements and setup instructions

📋 Paper Details

Structure

21 pages (excluding references)
4 figures (publication-quality PDFs)
5 tables (model rankings, category performance, provider comparisons)
27 bibliography entries (23 cited)

Sections

Introduction - Motivation and research questions
Related Work - Comprehensive review of instruction-following benchmarks
Methodology - Evaluation framework, test design, model selection
Results - Empirical findings across 256 models
Discussion - Implications, failure modes, future work
Conclusion - Summary and contributions

Related Work Coverage

Comprehensive citation of instruction-following research:

Foundational: IFEval, Self-Instruct
SOTA Benchmarks: InFoBench (DRFR), FollowBench, ComplexBench
Specialized: SIFo, EifBench, CELLO, LLMBar, RewardBench
Evaluation: Prometheus 2, PandaLM, Auto-J
Data Generation: WizardLM/Evol-Instruct, Magpie, Instruction Backtranslation
Domain Extensions: InfoSearch, FollowIR, InstructIR
Efficiency: tinyBenchmarks, REIFE

👥 Authors

Richard J. Young

Email: ryoung@unlv.edu Role: Principal Investigator, lead author Affiliation: University of Nevada, Las Vegas

Brandon Gillins

Email: bgillins@unlv.edu Role: Co-author, technical lead Affiliation: University of Nevada, Las Vegas

Alice M. Matthews

Email: amatthews@unlv.edu Role: Co-author, data analysis Affiliation: University of Nevada, Las Vegas

📊 Impact & Contributions

Practical Contributions

Diagnostic Tool - Compact 20-test suite for quick assessment
Empirical Data - 5,120 evaluations across 256 models (one of largest studies)
Open Resources - Complete dataset, code, and evaluation framework
Reproducibility - Frozen test bank, model snapshot, exact methodology

Research Contributions

Failure Mode Identification - Consistent patterns in instruction-following failures
Category Difficulty Analysis - String manipulation significantly harder than constraint compliance
Provider Comparisons - Systematic differences across training methodologies
Benchmark Design - Exact-match evaluation for objectivity and reproducibility

Community Value

Researchers can quickly assess new models
Practitioners can identify instruction-following strengths/weaknesses
Developers can diagnose specific failure modes
Community can extend with new diagnostic tests

🏆 Key Insights

1. Extreme Performance Variability

Even among recent models, instruction-following capabilities vary dramatically (0-100% pass rates). This suggests fundamental differences in training approaches and architectural choices.

2. Category-Specific Challenges

Models show highly uneven capabilities across instruction types. A model performing well on constraint compliance may fail completely on string manipulation.

3. The "Impossibly Hard Test"

Test 5 (Complex String Transformation) achieved only 2.7% pass rate (7 of 256 models). Even GPT-4 and Claude-3 frequently fail, suggesting certain instruction patterns remain extremely challenging.

4. Provider-Level Patterns

Providers differ substantially in average performance (33.3% to 79.3%), indicating that training methodologies significantly impact instruction adherence beyond model size or architecture.

5. Exact Match Feasibility

Despite challenges, 12 models achieved ≥85% pass rate, demonstrating that exact instruction following is possible with appropriate training and architecture.

📜 License

This paper is released under CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International).

✅ Share and adapt with attribution
✅ Use for research and educational purposes
❌ Commercial use prohibited
🔄 Share adaptations under the same license

🙏 Acknowledgments

We thank:

OpenRouter for unified API access to 256+ models
Model Providers (OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and others)
Instruction-Following Research Community for foundational benchmarks
University of Nevada, Las Vegas for research support

Paper Version: 1.0 Publication Date: October 18, 2025 arXiv ID: 2510.18892 Last Updated: October 23, 2025

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support