With the arrival of Twinkle April β Twinkle AIβs annual open-source celebration held every April β our community is excited to unveil its very first project:
Unlike traditional evaluation tools like iKalaβs ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient π² β for example, evaluating LRMs on the ikala/tmmluplus benchmark could take * half a day without finishing.
One question we were especially curious about: Does shuffling multiple-choice answer order impact model accuracy? π€ β See: "Change Answer Order Can Decrease MMLU Accuracy" β arXiv:2406.19470v1
To address these challenges, Twinkle Eval brings three key innovations to the table:
1οΈβ£ Parallelized evaluation of samples 2οΈβ£ Multi-round testing for stability 3οΈβ£ Randomized answer order to test robustness
After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15Γ ππ. Interestingly, most models scored slightly lower under the 2οΈβ£3οΈβ£ test settings compared to their claimed performance β suggesting further benchmarking is needed.
This framework also comes with additional tunable parameters and detailed logging of LM behavior per question β perfect for those who want to dive deeper. π
If you find Twinkle Eval useful, please β the project and help spread the word π€