Add model-index with benchmark evaluations
#20
by
davidlms
- opened
Added structured evaluation results from benchmark research:
- HLE (Humanity's Last Exam): 37.1
- FRAMES: 76.3
- τ²-Bench: 80.2
These benchmarks evaluate the model's performance on:
- General knowledge and reasoning (HLE)
- Factuality and retrieval accuracy in RAG systems (FRAMES)
- Conversational agent capabilities in dual-control environments (τ²-Bench)
Source: https://github.com/NVlabs/ToolOrchestra
This enables the model to appear in leaderboards and makes it easier to compare with other models.