Apologies for the delayed response. I've just read your message and I agree with your suggestion. Our research team has released this dataset (a public set is available, the rest is kept private to prevent benchmark hacking by LLM makers) and this GitHub repo to reproduce it: https://github.com/Giskard-AI/phare