Add model-index with benchmark evaluations

#20
by davidlms - opened

Added structured evaluation results from benchmark research:

  • HLE (Humanity's Last Exam): 37.1
  • FRAMES: 76.3
  • τ²-Bench: 80.2

These benchmarks evaluate the model's performance on:

  • General knowledge and reasoning (HLE)
  • Factuality and retrieval accuracy in RAG systems (FRAMES)
  • Conversational agent capabilities in dual-control environments (τ²-Bench)

Source: https://github.com/NVlabs/ToolOrchestra

This enables the model to appear in leaderboards and makes it easier to compare with other models.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment