reasoning-datasets-competition (Reasoning datasets competition )

codelion

posted an update about 18 hours ago

Post

101

MARS Achieves Strong Results on Google DeepMind's IMO-Bench

We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.

What is MARS?

MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.

Results on IMO-Bench:

AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains

Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)

ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp

Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)

All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.

Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench

Try it yourself:

python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025

Or via API with approach prefix:

model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"

Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm

codelion

posted an update 4 days ago

Post

3518

On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.

To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.

The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

If you're interested in the full story of how we discovered this optimal mixture and why curriculum learning catastrophically failed, check out the complete article: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Sometimes less really is more - when you mix it right.

1 reply

·

codelion

posted an update 5 days ago

Post

345

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.

Key Finding:

A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.

Results:

Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

The takeaway: careful dataset curation matters more than total data volume.

Model: codelion/gpt-2-70m

Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples

Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing

ZennyKenny

posted an update 7 days ago

Post

306

Anyone got the scoop on a good OCR model that's available on inference?

Keen to make use of an endpoint (gated or not -- happy to pay for usage) for a personal project, but not so keen to pay for the GPU hosting myself.

🙈🙈🙈

3 replies

·

ZennyKenny

posted an update 17 days ago

Post

310

Has anyone tried Strawberry Browser? https://strawberrybrowser.com/?ref_id=8D41NQCY7

😇 Shamelessly sharing my referral link here to move up in the waitlist line. Help me out, give it a click.

2 replies

·

codelion

posted an update 19 days ago

Post

3232

🧠 Introducing Ellora Recipe #6: Execution-Aware World Model for Qwen3-4B-Thinking

Teaching LLMs to understand not just what code does, but HOW it executes at runtime!

Inspired by Meta's CWM (Code World Model) research, this LoRA adapter adds execution awareness to Qwen3-4B-Thinking-2507. The model learns to predict variable states, trace program execution step-by-step, and debug code by understanding runtime behavior.

🔍 Key Innovation:
We combine Qwen3's native thinking capabilities with real Python execution traces captured via sys.settrace(). The model is trained using GRPO with a custom reward function that scores execution prediction accuracy.

📊 Training Approach:
- Hybrid Magpie-style code generation
- Real execution tracing for ground truth
- Self-supervised learning (no manual annotations!)
- 298 training samples with execution traces

✨ What it does:
- Predicts variable states at each line of code
- Explains execution flow with thinking tags
- Helps debug by understanding runtime behavior
- Works as a "neural debugger"

🎯 Results:
- 20% overall accuracy on execution prediction
- 33.3% mean state accuracy
- Trained on Qwen3-4B-Thinking (262K context, 4B params)

🔗 Links:
Model: codelion/Qwen3-4B-execution-world-model-lora
Dataset: codelion/execution-world-model-dataset
GitHub Recipe: https://github.com/codelion/ellora
Notebook: https://github.com/codelion/ellora/blob/main/Ellora_Recipe_6_Execution_World_Model_Thinking_LoRA.ipynb

Part of the Ellora project - standardized LoRA recipes for enhancing LLM capabilities. All recipes use self-supervised data generation and work with existing infrastructure (PEFT, LoRAX, vLLM).

#LLM #LoRA #CodeGeneration #WorldModel #Qwen #AI #MachineLearning

ZennyKenny

posted an update 24 days ago

Post

2162

Did Hugging Face just ban hammer a bunch of bot accounts or am I just so uninteresting that 30% of my subs dropped me overnight?

😬 Wait, don't answer that.

2 replies

·

ZennyKenny

authored a paper 26 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published about 1 month ago • 34

ZennyKenny

posted an update 26 days ago

Post

214

🔥 BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution from

bigcode is now available on Hugging Face!

👉 Check out the paper and please drop an upvote if you like the work BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution (2510.08697)

codelion

posted an update about 1 month ago

Post

1877

🚀 Adaptive Classifier v0.1.0: Now with ONNX Runtime Support!

We're excited to announce a major update to Adaptive Classifier - a flexible, continuous learning classification system that adapts to new classes without retraining!

What's New:

⚡ ONNX Runtime Integration: Get 1.14x faster CPU inference out of the box (up to 4x on x86 processors)

📦 INT8 Quantization: Models are now 4x smaller with minimal accuracy loss, making deployment easier and faster

🎯 Smart Loading: Automatically uses the best model variant for your hardware - quantized for speed by default, or unquantized for maximum accuracy

🔄 7.5x Faster Model Loading: Get started quickly with optimized model initialization

How It Works:

Adaptive Classifier lets you build text classifiers that continuously learn from new examples without catastrophic forgetting. Perfect for:
- Dynamic classification tasks where classes evolve over time
- Few-shot learning scenarios with limited training data
- Production systems that need to adapt to new categories

The new ONNX support means you get production-ready speed on CPU without any code changes - just load and run!

Try it now:

from adaptive_classifier import AdaptiveClassifier

# Load with ONNX automatically enabled (quantized for best performance)
classifier = AdaptiveClassifier.load("adaptive-classifier/llm-router")

# Add examples dynamically
classifier.add_examples(
["Route this to GPT-4", "Simple task for GPT-3.5"],
["strong", "weak"]
)

# Predict with optimized inference
predictions = classifier.predict("Complex reasoning task")

Check out our LLM Router model to see it in action:
adaptive-classifier/llm-router

GitHub Repository:
https://github.com/codelion/adaptive-classifier

Install now: pip install adaptive-classifier

We'd love to hear your feedback and see what you build with it!

#MachineLearning #NLP #ONNX #ContinuousLearning #TextClassification

ZennyKenny

posted an update about 1 month ago

Post

1237

🥊 Big Code Arena is live! bigcode/arena

💡

bigcode is an open scientific collaboration working on responsible training of large language models for coding applications.

👉 The Arena ranks LLMs based on their ability to support natural language vibe coding requests in a competitive format, based on feedback from human reviewers.

🧠 It was a pleasure to contribute to this project led by @terryyz and appear as an additional contributor in the Big Code Arena paper.

ZennyKenny

posted an update about 1 month ago

Post

8892

🖤 Probably one of my favorite projects that I've worked on so far, introducing Новояз (Novoyaz).

🛠 One of the first acts of the Bolshevik government after the Russian Revolution was the reform and standardization of the Russian language, which at the time had a non-standard and challenging orthography.

📚 Upon its reform the government launched a nationwide campaign called Ликбез (Likbez), which sought to improve literacy in the country (by the way, it worked, bringing the national literacy rate from <20% in the 1920s to >80% by the 1930s).

‼ While this is a remarkable result that should absolutely be celebrated, it's one that has left behind literally hundreds of thousands if not millions of artifacts using pre-reform Russian orthography.

😓 Researchers and historians are working tirelessly to translate these artifacts to modern Russian so that they may be archived and studied but many have told me that. they are doing this BY HAND (!).

💡 I thought, well this is a perfect use case for OCR and a fine-tuned LLM to step in and help to aid in this important work!

🌏 Introducing НОВОЯЗ (NOVOYAZ)! Powered by ChatDOC/OCRFlux-3B and ZennyKenny/oss-20b-prereform-to-modern-ru-merged, researchers can now convert images of their pre-reform documents to modern Russian orthography using the power of open-source AI!

Check it out and drop a like to support more real-world use cases for open source AI outside of traditional tech-centric domains!

ZennyKenny/Novoyaz

ZennyKenny

posted an update about 1 month ago

Post

557

🔒 Like a lot of other AI builders, I have some anxiety about the emerging surveillance-capitalist paradigm emerging in the AI space.

👉 Of course-- this kind of thing isn't completely new and has been going on for decades, but the difference is the stronger immersion of AI tools into our daily lives (compared to something like a search engine or social network).

❕ That's why I was really excited to come across Lumo: https://lumo.proton.me/u/1/

❕ Lumo is created by

ProtonPrivacy and offers privacy-first features that make sure that what you do with you AI assistant is your business.

❕ I already trust Proton with my other business apps and I've never been disappointed, plus the Lumo architecture is really fantastic, dynamically routing each query to the most appropriate model for the request.

🔥 Really awesome stuff Proton, thank you as always.

ZennyKenny

posted an update about 2 months ago

Post

2378

The reactions to mostlyai/synthetic-sdk-demo have been incredible! 🔥

Some users wrote that they were having performance issues on larger datasets, so I've capped the Space's input to 5000 rows and 10 columns, but you can always use the open source SDK that powers the space any time you want on datasets of arbitrary size and shape!

Check it out: https://github.com/mostly-ai/mostlyai 👈

Tonic

posted an update about 2 months ago

Post

751

the french ministry of culture releases their first conversation datasets on huggingface 👇🏻
ministere-culture/comparia-conversations

codelion

posted an update about 2 months ago

Post

3781

🚀 Adaptive Classifier v0.0.17 Released - Major Accuracy Improvements!

We've just released a major update fixing critical bugs that were causing 40-50% accuracy drops in our enterprise classifiers!

Key Fixes:
• Fixed k-parameter prediction bug causing massive accuracy loss
• Improved incremental learning for new classes
• Enhanced weight preservation during model updates

Dramatic Results:
• fraud-detection: 43.9% → 92.7% (+48.8%) adaptive-classifier/fraud-detection
• business-sentiment: 88.9% → 98.8% (+9.9%) adaptive-classifier/business-sentiment expense-category: 26.7% → 84.2% (+57.5%)
adaptive-classifier/expense-category
• language-detection: 98.8% → 100% (+1.2%) adaptive-classifier/language-detection

15/17 enterprise classifiers now maintain ≤5% accuracy difference from original performance!

Other High-Performing Models:
• email-security (93.8%): adaptive-classifier/email-security
• content-moderation (100%): adaptive-classifier/content-moderation
• pii-detection (100%): adaptive-classifier/pii-detection

Quick Start:
from adaptive_classifier import AdaptiveClassifier
classifier = AdaptiveClassifier.load("adaptive-classifier/fraud-detection")
predictions = classifier.predict("Suspicious transaction pattern", k=3)

Install: pip install --upgrade adaptive-classifier==0.0.17

All models:

adaptive-classifier

🎯 Production-ready continuous learning for enterprise text classification!

#MachineLearning #TextClassification #ContinualLearning #EnterpriseAI

Tonic

posted an update about 2 months ago

Post

684

COMPUTER CONTROL IS ON-DEVICE !

🏡🤖 78 % of EU smart-home owners DON’T trust cloud voice assistants.

So we killed the cloud.

Meet Exté: a palm-sized Android device that sees, hears & speaks your language - 100 % offline, 0 % data sent anywhere.

🔓 We submitted our technologies for consideration to the Liquid AI hackathon.

📊 Dataset: 79 k UI-action pairs on Hugging Face (largest Android-control corpus ever) Tonic/android-operator-episodes

⚡ Model: 98 % task accuracy, 678MB compressed , fits on existing android devices ! Tonic/l-android-control

🛤️ Experiment Tracker : check out the training on our TrackioApp Tonic/l-android-control

🎮 Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo

Built in a garage, funded by pre-orders, no VC. Now we’re scaling to 1 k installer units.

We’re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.

👇 Drop “EUSKERA” in the comments if you want an invite, tag a friend who still thinks Alexa is “convenient,” and smash ♥️ if AI should belong to people - not servers.

ZennyKenny

posted an update about 2 months ago

Post

2636

The open source Synthetic Data SDK from MOSTLY AI:

mostlyai offers the ability to generate realistic, privacy-safe synthetic data with just a few lines of Python.

Try it out yourself in a No Code UI in the SDK Demo Space: mostlyai/synthetic-sdk-demo

UVSKKR

authored a paper about 2 months ago

EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI

Paper • 2509.11648 • Published Sep 15 • 1

UVSKKR

authored a paper 2 months ago

D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Paper • 2509.06771 • Published Sep 8 • 5

Reasoning datasets competition

AI & ML interests

Recent Activity

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI

D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

AI & ML interests

Recent Activity

Team members 41

reasoning-datasets-competition's activity