Running 35 TRUEBench 🔥 35 Explore and compare language model performance across categories and languages