alexcombessie (Alex Combessie)

upvoted an article 12 days ago

Article

Phare LLM benchmark V2: Reasoning models don't guarantee better security

12 days ago

•

9

liked a model 22 days ago

nvidia/Nemotron-Content-Safety-Reasoning-4B

Text Generation • 4B • Updated 23 days ago • 992 • 11

liked 2 models about 2 months ago

openai/gpt-oss-safeguard-20b

Text Generation • 22B • Updated Oct 29 • 36.7k • • 169

openai/gpt-oss-safeguard-120b

Text Generation • 120B • Updated Oct 29 • 19.9k • 76

upvoted an article 3 months ago

Article

LLM vulnerability scanner for dynamic & multi-turn Red Teaming

Sep 25

•

2

liked a Space 4 months ago

PosterGen

🎨

2

Multi-Agent Academic Poster Generation

liked 2 models 5 months ago

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26 • 3.78M • • 4.29k

zai-org/GLM-4.5

Text Generation • 358B • Updated Aug 11 • 22.9k • • 1.39k

updated a Space 5 months ago

README

🐢

liked a dataset 5 months ago

giskardai/realharm

Viewer • Updated Apr 16 • 136 • 152 • 12

commented on Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs 5 months ago

Hey @breckb , Alex here, co-founder of Giskard AI.

Apologies for the delayed response. I've just read your message and I agree with your suggestion. Our research team has released this dataset (a public set is available, the rest is kept private to prevent benchmark hacking by LLM makers) and this GitHub repo to reproduce it: https://github.com/Giskard-AI/phare

Hope it helps!

upvoted 2 articles 5 months ago

Article

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

May 7

•

42

Article

RealPerformance, A Dataset of Language Model Business Compliance Issues

Jul 21

•

4

upvoted an article 6 months ago

Article

LLMs recognise bias but also reproduce harmful stereotypes: an analysis of bias in leading LLMs

Jul 2

•

16

liked a Space 6 months ago

LLM Performance Leaderboard

🐨

429

View LLM performance rankings

upvoted a paper 7 months ago

Phare: A Safety Probe for Large Language Models

Paper • 2505.11365 • Published May 16 • 7

upvoted a paper 9 months ago

RealHarm: A Collection of Real-World Language Model Application Failures

Paper • 2504.10277 • Published Apr 14 • 10

liked a Space 9 months ago

MTEB Leaderboard

🥇

6.86k

Embedding Leaderboard

liked a dataset 9 months ago

giskardai/phare

Viewer • Updated 17 days ago • 4.05k • 686 • 12

liked a Space 9 months ago

Qwen2.5 Omni 7B Demo

🏆

364

Generate text and speech responses from text, audio, images, or video input

Alex Combessie

AI & ML interests

Recent Activity

Organizations

Phare LLM benchmark V2: Reasoning models don't guarantee better security

nvidia/Nemotron-Content-Safety-Reasoning-4B

openai/gpt-oss-safeguard-20b

openai/gpt-oss-safeguard-120b

LLM vulnerability scanner for dynamic & multi-turn Red Teaming

PosterGen

openai/gpt-oss-120b

zai-org/GLM-4.5

README

giskardai/realharm

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

RealPerformance, A Dataset of Language Model Business Compliance Issues

LLMs recognise bias but also reproduce harmful stereotypes: an analysis of bias in leading LLMs

LLM Performance Leaderboard

Phare: A Safety Probe for Large Language Models

RealHarm: A Collection of Real-World Language Model Application Failures

MTEB Leaderboard

giskardai/phare

Qwen2.5 Omni 7B Demo

Alex Combessie

AI & ML interests

Recent Activity

Organizations

alexcombessie's activity

Phare LLM benchmark V2: Reasoning models don't guarantee better security

LLM vulnerability scanner for dynamic & multi-turn Red Teaming

PosterGen

README

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

RealPerformance, A Dataset of Language Model Business Compliance Issues

LLMs recognise bias but also reproduce harmful stereotypes: an analysis of bias in leading LLMs

LLM Performance Leaderboard

MTEB Leaderboard

Qwen2.5 Omni 7B Demo