Yi Cui's picture

Yi Cui

onekq

·

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update 26 days ago

GPT 5.1 codex didn't make SOTA either. This should conclude 2025. No model has ever reached above 0.8. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard Can this leaderboard be saturated in 2026?

updated a Space 26 days ago

onekq-ai/WebApp1K-models-leaderboard

replied to their post 27 days ago

I am starting a new series on matrix. The idea came to me when I wrote about the Muon optimizer. Matrix itself has lots of fascinating properties and is applied in STEM fields for many decades. Its application in ML is just the beginning. There are lots of low hanging fruits. At the very least, I hope this math perspective will give you a new lens. https://huggingface.co/blog/onekq/matrices-transformers-preface

View all activity

Organizations

posted an update 26 days ago

Post

314

GPT 5.1 codex didn't make SOTA either. This should conclude 2025. No model has ever reached above 0.8.

onekq-ai/WebApp1K-models-leaderboard

Can this leaderboard be saturated in 2026?

replied to their post 27 days ago

Hmm, I see.

posted an update 28 days ago

Post

262

I am starting a new series on matrix. The idea came to me when I wrote about the Muon optimizer.

Matrix itself has lots of fascinating properties and is applied in STEM fields for many decades. Its application in ML is just the beginning. There are lots of low hanging fruits. At the very least, I hope this math perspective will give you a new lens.

https://huggingface.co/blog/onekq/matrices-transformers-preface

2 replies

·

posted an update 29 days ago

Post

222

DeepSeek v3.2 is worse than R1. This is quite puzzling. Why the regression with new GRPO and new attention?

onekq-ai/WebApp1K-models-leaderboard

I used reasoning mode against DeepSeek API

posted an update about 1 month ago

Post

248

Hard-earned lessons to land your agent (some mine, most learned from others)

1. Clarify expectations. what do you mean by automating emails? auto drafting? replying via templates? extracting details into json?

2. Get access to your customer's corp/prod environment. Guest or sandbox won't cut it, much less your demo account.

3. Don't expect your agent to be turn-key. It will take at least a quarter to stabilize, if your customer actually uses it.

posted an update about 1 month ago

Post

245

Claude Opus 4.5 didn't make SOTA either. So many models are stuck at 0.75 now

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

383

The second point re Ilya post is about RL pain point, i.e. sparse reward. I'm optimistic on this front.

Our actions are driven by unspeakable instincts, which left no traces in training set (pretraining or synthetic). These process rewards (motion sensing, vision etc.) help you master new skills quickly, like biking. Outcome reward only (falling off the bike) is indeed too sparse.

But lots of tasks can benefit from outcome rewards alone. Many latest RL works to upgrade SQL skills use success-failure reward only, with executable as optional reward.

Additionally, scale is the secret sauce for models to surpass humans. A human agent can learn a task quickly, but is capacity limited. But a model agent can process tasks in the scale of many human lifetimes. This made up for the inadequacy of process rewards.

Many such tasks happen to be economically viable, i.e. salary-making jobs.

posted an update about 1 month ago

Post

271

Ilya's interview has been widely cited. I won't address meta points but share 2 cents on two mundane issues.

I will start with the leaderboard phenomena. This is a feature, not bug. Model training is a project under founder mode. But still like all projects, it needs north stars. And you guess right, (famous) leaderboards are the north stars.

For those startups which found PMFs, many maintain their own proprietary leaderboards/benchmarks condensed from user traffic. The path is blocked on both directions: startups won's share their moats, model makers won't prioritize either.

So instead of complaining, we should celebrate that our prompts work (most of the time)

posted an update about 1 month ago

Post

321

Grok 4.1 didn't make SOTA, but improves a great deal over 3.
onekq-ai/WebApp1K-models-leaderboard

Members of the 70% club are the 4 big players (GPT, Claude, Gemini, Grok) and Kimi.

posted an update about 1 month ago

Post

252

No SOTA from Gemini 3 either 😖

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

284

If RAG (by that I meant vectors and embeddings) transitions from QA to agents, is scalability (from wikipedia to personal memory) still an issue? What will be the new challenges?

Anyone care to share experience?

posted an update about 1 month ago

Post

2107

No SOTA from gpt5 codex

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

795

Sorry folks. No SOTA from GPT 5.1

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

2622

GLM 4.6 is on a par with Gemini 2

onekq-ai/WebApp1K-models-leaderboard

1 reply

·

posted an update about 2 months ago

Post

1357

This post is the byproduct of my investigation on GPU depreciation. Very interesting dynamics between Chinese models and American chips.

https://huggingface.co/blog/onekq/nvfp4-int4

More stories like this will emerge down the road.

posted an update about 2 months ago

Post

245

Here is the post on Muon optimizer. It's getting hard core. I tried to visualize orthogonalization but decided to drop it to avoid miscommunication.

https://huggingface.co/blog/onekq/muon-optimizer

No matter which angle I take, I can't detect slowdown. It's the opposite in fact.

posted an update about 2 months ago

Post

2857

The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.

posted an update about 2 months ago

Post

2443

Instead of architectural upgade, each major model drop nowadays perfects a regional innovation. What Kimi brought to spot light this time is quantization aware training (QAT). I wrote an article to explain it and why it matters to reasoning models.

https://huggingface.co/blog/onekq/qat-bonsai

If you are interested in this kind of posts, I will introduce the Muon optimizers, another technology behind Kimi success.

posted an update about 2 months ago

Post

275

Wow, the Kimi K2 thinking model beats Gemini and DeeepSeek R1.

onekq-ai/WebApp1K-models-leaderboard

posted an update about 2 months ago

Post

1579

To make agent work for us when we sleep, we must break the curse of sessions.