When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs
Authors: Richard J. Young, Brandon Gillins, Alice M. Matthews Affiliation: University of Nevada, Las Vegas Published: October 18, 2025 arXiv ID: 2510.18892
π Paper Abstract
Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance.
This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion.
Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution.
We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.
π Key Findings
Overall Performance
- 256 models evaluated across 20 diagnostic tests (5,120 total evaluations)
- Overall pass rate: 43.7% (substantial variation in instruction-following capabilities)
- Performance range: 0% to 100% (extreme variation across models)
- Standard deviation: 28.4 percentage points (highly heterogeneous landscape)
Top Performing Models
- qwen/qwen-plus-2025-07-28:thinking - 100% (20/20 tests passed)
- deepseek/deepseek-r1 - 95% (19/20 tests passed)
- openai/o1 - 95% (19/20 tests passed)
- qwen/qwq-32b-preview - 95% (19/20 tests passed)
- deepseek/deepseek-r1-distill-llama-70b - 90% (18/20 tests passed)
Provider Performance
Average performance by provider (β₯3 models):
- x-ai: 79.3% (15 models)
- google: 58.8% (34 models)
- openai: 57.5% (32 models)
- qwen: 54.4% (27 models)
- deepseek: 53.3% (15 models)
Test Difficulty
- Hardest test: Test 5 (Complex String Transformation) - 2.7% pass rate
- Easiest tests: Test 2 (Exact Output Compliance) & Test 15 (Safety Refusal) - 96.1% pass rate
- 54.9 percentage point gap between easiest and hardest categories
Category Performance
- Constraint Compliance: 66.9% (easiest)
- Text Processing: 50.5%
- Structured Data: 41.1%
- Complex Operations: 35.0%
- String Manipulation: 12.0% (hardest)
π Methodology
Evaluation Framework
- 20 diagnostic instruction-following prompts covering diverse task types
- Exact-match evaluation (binary pass/fail, no partial credit)
- Whitespace normalization (leading/trailing spaces ignored)
- Case-sensitive where specified
- Format-strict (JSON, tables, special characters must be exact)
Model Selection
- 331 models initially available via OpenRouter (October 14, 2025)
- 256 models verified working (77% verification rate)
- Two-stage verification: endpoint availability + basic instruction test
- Diverse providers: OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and 20+ others
Test Categories
- String Manipulation (5 tests) - Multi-step text transformations
- Constraint Compliance (3 tests) - Exact output specifications
- Text Processing (1 test) - Targeted text manipulation
- Structured Data (5 tests) - JSON, Markdown, CSV generation
- Complex Operations (6 tests) - Multi-step reasoning and computation
π Figures
The paper includes four publication-quality figures:
Figure 1: Performance Heatmap
Performance matrix showing top 50 models across all 20 tests (green=pass, red=fail). Reveals patterns in which models excel at which instruction types.
Figure 2: Provider Comparison
Average performance by provider (minimum 3 models). Shows x-ai leading at 79.3%, with substantial variation across providers (46 percentage point spread).
Figure 3: Test Difficulty
Pass rates for all 20 tests, color-coded by difficulty level. Highlights extreme variation from 2.7% (Test 5) to 96.1% (Tests 2 & 15).
Figure 4: Category Performance
Average pass rates across five test categories. String Manipulation (12.0%) proves most challenging, while Constraint Compliance (66.9%) is easiest.
π Citation
If you use this work in your research, please cite:
@article{young2025instruction,
title={When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs},
author={Young, Richard J. and Gillins, Brandon and Matthews, Alice M.},
journal={arXiv preprint arXiv:2510.18892},
year={2025},
url={http://arxiv.org/abs/2510.18892}
}
π Resources
Paper & Data
- arXiv Paper: http://arxiv.org/abs/2510.18892
- Full Dataset: https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval
- Complete evaluation results (5,120 evaluations)
- Excel workbook with multiple analysis sheets
- JSON export with metadata
- Publication-quality PDF figures
Code & Reproducibility
- Code Repository: https://huggingface.co/richardyoung/llm-instruction-following-code
- Complete evaluation framework (Python)
- All 20 diagnostic test prompts
- Analysis and visualization scripts
- Model configuration (256 verified models)
- Requirements and setup instructions
π Paper Details
Structure
- 21 pages (excluding references)
- 4 figures (publication-quality PDFs)
- 5 tables (model rankings, category performance, provider comparisons)
- 27 bibliography entries (23 cited)
Sections
- Introduction - Motivation and research questions
- Related Work - Comprehensive review of instruction-following benchmarks
- Methodology - Evaluation framework, test design, model selection
- Results - Empirical findings across 256 models
- Discussion - Implications, failure modes, future work
- Conclusion - Summary and contributions
Related Work Coverage
Comprehensive citation of instruction-following research:
- Foundational: IFEval, Self-Instruct
- SOTA Benchmarks: InFoBench (DRFR), FollowBench, ComplexBench
- Specialized: SIFo, EifBench, CELLO, LLMBar, RewardBench
- Evaluation: Prometheus 2, PandaLM, Auto-J
- Data Generation: WizardLM/Evol-Instruct, Magpie, Instruction Backtranslation
- Domain Extensions: InfoSearch, FollowIR, InstructIR
- Efficiency: tinyBenchmarks, REIFE
π₯ Authors
Richard J. Young
Email: ryoung@unlv.edu Role: Principal Investigator, lead author Affiliation: University of Nevada, Las Vegas
Brandon Gillins
Email: bgillins@unlv.edu Role: Co-author, technical lead Affiliation: University of Nevada, Las Vegas
Alice M. Matthews
Email: amatthews@unlv.edu Role: Co-author, data analysis Affiliation: University of Nevada, Las Vegas
π Impact & Contributions
Practical Contributions
- Diagnostic Tool - Compact 20-test suite for quick assessment
- Empirical Data - 5,120 evaluations across 256 models (one of largest studies)
- Open Resources - Complete dataset, code, and evaluation framework
- Reproducibility - Frozen test bank, model snapshot, exact methodology
Research Contributions
- Failure Mode Identification - Consistent patterns in instruction-following failures
- Category Difficulty Analysis - String manipulation significantly harder than constraint compliance
- Provider Comparisons - Systematic differences across training methodologies
- Benchmark Design - Exact-match evaluation for objectivity and reproducibility
Community Value
- Researchers can quickly assess new models
- Practitioners can identify instruction-following strengths/weaknesses
- Developers can diagnose specific failure modes
- Community can extend with new diagnostic tests
π Key Insights
1. Extreme Performance Variability
Even among recent models, instruction-following capabilities vary dramatically (0-100% pass rates). This suggests fundamental differences in training approaches and architectural choices.
2. Category-Specific Challenges
Models show highly uneven capabilities across instruction types. A model performing well on constraint compliance may fail completely on string manipulation.
3. The "Impossibly Hard Test"
Test 5 (Complex String Transformation) achieved only 2.7% pass rate (7 of 256 models). Even GPT-4 and Claude-3 frequently fail, suggesting certain instruction patterns remain extremely challenging.
4. Provider-Level Patterns
Providers differ substantially in average performance (33.3% to 79.3%), indicating that training methodologies significantly impact instruction adherence beyond model size or architecture.
5. Exact Match Feasibility
Despite challenges, 12 models achieved β₯85% pass rate, demonstrating that exact instruction following is possible with appropriate training and architecture.
π License
This paper is released under CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International).
- β Share and adapt with attribution
- β Use for research and educational purposes
- β Commercial use prohibited
- π Share adaptations under the same license
π Acknowledgments
We thank:
- OpenRouter for unified API access to 256+ models
- Model Providers (OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and others)
- Instruction-Following Research Community for foundational benchmarks
- University of Nevada, Las Vegas for research support
Paper Version: 1.0 Publication Date: October 18, 2025 arXiv ID: 2510.18892 Last Updated: October 23, 2025