AI & ML interests

None defined yet.

Recent Activity

codelionย 
posted an update about 18 hours ago
view post
Post
101
MARS Achieves Strong Results on Google DeepMind's IMO-Bench

We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.

What is MARS?

MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.

Results on IMO-Bench:

AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains

Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)

ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp

Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)

All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.

Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench

Try it yourself:

python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025

Or via API with approach prefix:

model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"

Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm
codelionย 
posted an update 4 days ago
view post
Post
3518
On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.

To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.

The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

If you're interested in the full story of how we discovered this optimal mixture and why curriculum learning catastrophically failed, check out the complete article: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Sometimes less really is more - when you mix it right.
  • 1 reply
ยท
codelionย 
posted an update 5 days ago
view post
Post
345
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.

Key Finding:

A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.

Results:

Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

The takeaway: careful dataset curation matters more than total data volume.

Model: codelion/gpt-2-70m

Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples

Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing
ZennyKennyย 
posted an update 7 days ago
view post
Post
306
Anyone got the scoop on a good OCR model that's available on inference?

Keen to make use of an endpoint (gated or not -- happy to pay for usage) for a personal project, but not so keen to pay for the GPU hosting myself.

๐Ÿ™ˆ๐Ÿ™ˆ๐Ÿ™ˆ
  • 3 replies
ยท
ZennyKennyย 
posted an update 17 days ago
codelionย 
posted an update 19 days ago
view post
Post
3232
๐Ÿง  Introducing Ellora Recipe #6: Execution-Aware World Model for Qwen3-4B-Thinking

Teaching LLMs to understand not just what code does, but HOW it executes at runtime!

Inspired by Meta's CWM (Code World Model) research, this LoRA adapter adds execution awareness to Qwen3-4B-Thinking-2507. The model learns to predict variable states, trace program execution step-by-step, and debug code by understanding runtime behavior.

๐Ÿ” Key Innovation:
We combine Qwen3's native thinking capabilities with real Python execution traces captured via sys.settrace(). The model is trained using GRPO with a custom reward function that scores execution prediction accuracy.

๐Ÿ“Š Training Approach:
- Hybrid Magpie-style code generation
- Real execution tracing for ground truth
- Self-supervised learning (no manual annotations!)
- 298 training samples with execution traces

โœจ What it does:
- Predicts variable states at each line of code
- Explains execution flow with thinking tags
- Helps debug by understanding runtime behavior
- Works as a "neural debugger"

๐ŸŽฏ Results:
- 20% overall accuracy on execution prediction
- 33.3% mean state accuracy
- Trained on Qwen3-4B-Thinking (262K context, 4B params)

๐Ÿ”— Links:
Model: codelion/Qwen3-4B-execution-world-model-lora
Dataset: codelion/execution-world-model-dataset
GitHub Recipe: https://github.com/codelion/ellora
Notebook: https://github.com/codelion/ellora/blob/main/Ellora_Recipe_6_Execution_World_Model_Thinking_LoRA.ipynb

Part of the Ellora project - standardized LoRA recipes for enhancing LLM capabilities. All recipes use self-supervised data generation and work with existing infrastructure (PEFT, LoRAX, vLLM).

#LLM #LoRA #CodeGeneration #WorldModel #Qwen #AI #MachineLearning
ZennyKennyย 
posted an update 24 days ago
view post
Post
2162
Did Hugging Face just ban hammer a bunch of bot accounts or am I just so uninteresting that 30% of my subs dropped me overnight?

๐Ÿ˜ฌ Wait, don't answer that.
  • 2 replies
ยท
ZennyKennyย 
posted an update 26 days ago
codelionย 
posted an update about 1 month ago
view post
Post
1877
๐Ÿš€ Adaptive Classifier v0.1.0: Now with ONNX Runtime Support!

We're excited to announce a major update to Adaptive Classifier - a flexible, continuous learning classification system that adapts to new classes without retraining!

What's New:

โšก ONNX Runtime Integration: Get 1.14x faster CPU inference out of the box (up to 4x on x86 processors)

๐Ÿ“ฆ INT8 Quantization: Models are now 4x smaller with minimal accuracy loss, making deployment easier and faster

๐ŸŽฏ Smart Loading: Automatically uses the best model variant for your hardware - quantized for speed by default, or unquantized for maximum accuracy

๐Ÿ”„ 7.5x Faster Model Loading: Get started quickly with optimized model initialization

How It Works:

Adaptive Classifier lets you build text classifiers that continuously learn from new examples without catastrophic forgetting. Perfect for:
- Dynamic classification tasks where classes evolve over time
- Few-shot learning scenarios with limited training data
- Production systems that need to adapt to new categories

The new ONNX support means you get production-ready speed on CPU without any code changes - just load and run!

Try it now:

from adaptive_classifier import AdaptiveClassifier

# Load with ONNX automatically enabled (quantized for best performance)
classifier = AdaptiveClassifier.load("adaptive-classifier/llm-router")

# Add examples dynamically
classifier.add_examples(
["Route this to GPT-4", "Simple task for GPT-3.5"],
["strong", "weak"]
)

# Predict with optimized inference
predictions = classifier.predict("Complex reasoning task")

Check out our LLM Router model to see it in action:
adaptive-classifier/llm-router

GitHub Repository:
https://github.com/codelion/adaptive-classifier

Install now: pip install adaptive-classifier

We'd love to hear your feedback and see what you build with it!

#MachineLearning #NLP #ONNX #ContinuousLearning #TextClassification
ZennyKennyย 
posted an update about 1 month ago
view post
Post
1237
๐ŸฅŠ Big Code Arena is live! bigcode/arena

๐Ÿ’ก bigcode is an open scientific collaboration working on responsible training of large language models for coding applications.

๐Ÿ‘‰ The Arena ranks LLMs based on their ability to support natural language vibe coding requests in a competitive format, based on feedback from human reviewers.

๐Ÿง  It was a pleasure to contribute to this project led by @terryyz and appear as an additional contributor in the Big Code Arena paper.
ZennyKennyย 
posted an update about 1 month ago
view post
Post
8892
๐Ÿ–ค Probably one of my favorite projects that I've worked on so far, introducing ะะพะฒะพัะท (Novoyaz).

๐Ÿ›  One of the first acts of the Bolshevik government after the Russian Revolution was the reform and standardization of the Russian language, which at the time had a non-standard and challenging orthography.

๐Ÿ“š Upon its reform the government launched a nationwide campaign called ะ›ะธะบะฑะตะท (Likbez), which sought to improve literacy in the country (by the way, it worked, bringing the national literacy rate from <20% in the 1920s to >80% by the 1930s).

โ€ผ While this is a remarkable result that should absolutely be celebrated, it's one that has left behind literally hundreds of thousands if not millions of artifacts using pre-reform Russian orthography.

๐Ÿ˜“ Researchers and historians are working tirelessly to translate these artifacts to modern Russian so that they may be archived and studied but many have told me that. they are doing this BY HAND (!).

๐Ÿ’ก I thought, well this is a perfect use case for OCR and a fine-tuned LLM to step in and help to aid in this important work!

๐ŸŒ Introducing ะะžะ’ะžะฏะ— (NOVOYAZ)! Powered by ChatDOC/OCRFlux-3B and ZennyKenny/oss-20b-prereform-to-modern-ru-merged, researchers can now convert images of their pre-reform documents to modern Russian orthography using the power of open-source AI!

Check it out and drop a like to support more real-world use cases for open source AI outside of traditional tech-centric domains!

ZennyKenny/Novoyaz
ZennyKennyย 
posted an update about 1 month ago
view post
Post
557
๐Ÿ”’ Like a lot of other AI builders, I have some anxiety about the emerging surveillance-capitalist paradigm emerging in the AI space.

๐Ÿ‘‰ Of course-- this kind of thing isn't completely new and has been going on for decades, but the difference is the stronger immersion of AI tools into our daily lives (compared to something like a search engine or social network).

โ• That's why I was really excited to come across Lumo: https://lumo.proton.me/u/1/

โ• Lumo is created by ProtonPrivacy and offers privacy-first features that make sure that what you do with you AI assistant is your business.

โ• I already trust Proton with my other business apps and I've never been disappointed, plus the Lumo architecture is really fantastic, dynamically routing each query to the most appropriate model for the request.

๐Ÿ”ฅ Really awesome stuff Proton, thank you as always.
ZennyKennyย 
posted an update about 2 months ago
view post
Post
2378
The reactions to mostlyai/synthetic-sdk-demo have been incredible! ๐Ÿ”ฅ

Some users wrote that they were having performance issues on larger datasets, so I've capped the Space's input to 5000 rows and 10 columns, but you can always use the open source SDK that powers the space any time you want on datasets of arbitrary size and shape!

Check it out: https://github.com/mostly-ai/mostlyai ๐Ÿ‘ˆ
Tonicย 
posted an update about 2 months ago
codelionย 
posted an update about 2 months ago
view post
Post
3781
๐Ÿš€ Adaptive Classifier v0.0.17 Released - Major Accuracy Improvements!

We've just released a major update fixing critical bugs that were causing 40-50% accuracy drops in our enterprise classifiers!

Key Fixes:
โ€ข Fixed k-parameter prediction bug causing massive accuracy loss
โ€ข Improved incremental learning for new classes
โ€ข Enhanced weight preservation during model updates

Dramatic Results:
โ€ข fraud-detection: 43.9% โ†’ 92.7% (+48.8%) adaptive-classifier/fraud-detection
โ€ข business-sentiment: 88.9% โ†’ 98.8% (+9.9%) adaptive-classifier/business-sentiment expense-category: 26.7% โ†’ 84.2% (+57.5%)
adaptive-classifier/expense-category
โ€ข language-detection: 98.8% โ†’ 100% (+1.2%) adaptive-classifier/language-detection

15/17 enterprise classifiers now maintain โ‰ค5% accuracy difference from original performance!

Other High-Performing Models:
โ€ข email-security (93.8%): adaptive-classifier/email-security
โ€ข content-moderation (100%): adaptive-classifier/content-moderation
โ€ข pii-detection (100%): adaptive-classifier/pii-detection

Quick Start:
from adaptive_classifier import AdaptiveClassifier
classifier = AdaptiveClassifier.load("adaptive-classifier/fraud-detection")
predictions = classifier.predict("Suspicious transaction pattern", k=3)

Install: pip install --upgrade adaptive-classifier==0.0.17

All models: adaptive-classifier

๐ŸŽฏ Production-ready continuous learning for enterprise text classification!

#MachineLearning #TextClassification #ContinualLearning #EnterpriseAI
Tonicย 
posted an update about 2 months ago
view post
Post
684
COMPUTER CONTROL IS ON-DEVICE !

๐Ÿก๐Ÿค– 78 % of EU smart-home owners DONโ€™T trust cloud voice assistants.

So we killed the cloud.

Meet Extรฉ: a palm-sized Android device that sees, hears & speaks your language - 100 % offline, 0 % data sent anywhere.

๐Ÿ”“ We submitted our technologies for consideration to the Liquid AI hackathon.

๐Ÿ“Š Dataset: 79 k UI-action pairs on Hugging Face (largest Android-control corpus ever) Tonic/android-operator-episodes

โšก Model: 98 % task accuracy, 678MB compressed , fits on existing android devices ! Tonic/l-android-control

๐Ÿ›ค๏ธ Experiment Tracker : check out the training on our TrackioApp Tonic/l-android-control

๐ŸŽฎ Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo



Built in a garage, funded by pre-orders, no VC. Now weโ€™re scaling to 1 k installer units.

Weโ€™re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.

๐Ÿ‘‡ Drop โ€œEUSKERAโ€ in the comments if you want an invite, tag a friend who still thinks Alexa is โ€œconvenient,โ€ and smash โ™ฅ๏ธ if AI should belong to people - not servers.
ZennyKennyย 
posted an update about 2 months ago
view post
Post
2636
The open source Synthetic Data SDK from MOSTLY AI: mostlyai offers the ability to generate realistic, privacy-safe synthetic data with just a few lines of Python.

Try it out yourself in a No Code UI in the SDK Demo Space: mostlyai/synthetic-sdk-demo