AI & ML interests

code LLMs, static analysis, software composition analysis, vulnerability remediation, application security

codelionย 
posted an update about 14 hours ago
view post
Post
473
๐ŸŽฏ Introducing Chayan: A Calibrated 4-Model LLM Router Achieving 69% Accuracy on RouterArena

We're excited to share Chayan, a cost-efficient LLM router that intelligently routes queries between 4 models to maximize accuracy while minimizing cost. Chayan just submitted to the RouterArena leaderboard and achieved 69.05% accuracy on the benchmark!

๐Ÿ”— Model: adaptive-classifier/chayan
๐Ÿ”— Dataset: RouteWorks/RouterArena

๐Ÿ“Š Performance Highlights

Chayan achieves impressive results on the RouterArena benchmark:
โ€ข 69.05% accuracy (would rank #1 on current leaderboard)
โ€ข $0.333 per 1K queries
โ€ข +12.07pp improvement over all-mini baseline (56.98%)
โ€ข 99% of perfect 2-model oracle performance at 57% lower cost

Compared to our previous 2-model router (61.43% accuracy), Chayan delivers +7.62pp improvement through smarter 4-model routing.

๐Ÿง  How It Works

Chayan uses an Adaptive K-NN classifier with prototype memory to route between 4 models:
โ€ข openai/gpt-4o-mini (fast & cheap)
โ€ข google/gemini-2.5-flash-lite (balanced)
โ€ข google/gemini-2.5-flash (capable)
โ€ข openai/gpt-4o (most powerful)

๐Ÿš€ Getting Started

You can use Chayan directly from HuggingFace:

from adaptive_classifier import AdaptiveClassifier

Load Chayan
router = AdaptiveClassifier.load("adaptive-classifier/chayan")

Route a query
query = "What is the capital of France?"
predictions = router.predict(query, k=4)

Get top model recommendation
best_model = predictions[0][0]
print(f"Recommended model: {best_model}")

Built with the adaptive-classifier library: https://github.com/codelion/adaptive-classifier
codelionย 
posted an update 6 days ago
view post
Post
3907
Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing
  • 1 reply
ยท
codelionย 
posted an update 8 days ago
view post
Post
241
MARS Achieves Strong Results on Google DeepMind's IMO-Bench

We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.

What is MARS?

MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.

Results on IMO-Bench:

AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains

Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)

ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp

Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)

All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.

Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench

Try it yourself:

python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025

Or via API with approach prefix:

model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"

Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm
codelionย 
posted an update 11 days ago
view post
Post
3577
On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.

To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.

The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

If you're interested in the full story of how we discovered this optimal mixture and why curriculum learning catastrophically failed, check out the complete article: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Sometimes less really is more - when you mix it right.
  • 1 reply
ยท
codelionย 
posted an update 12 days ago
view post
Post
371
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.

Key Finding:

A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.

Results:

Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

The takeaway: careful dataset curation matters more than total data volume.

Model: codelion/gpt-2-70m

Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples

Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing
codelionย 
posted an update 26 days ago
view post
Post
3242
๐Ÿง  Introducing Ellora Recipe #6: Execution-Aware World Model for Qwen3-4B-Thinking

Teaching LLMs to understand not just what code does, but HOW it executes at runtime!

Inspired by Meta's CWM (Code World Model) research, this LoRA adapter adds execution awareness to Qwen3-4B-Thinking-2507. The model learns to predict variable states, trace program execution step-by-step, and debug code by understanding runtime behavior.

๐Ÿ” Key Innovation:
We combine Qwen3's native thinking capabilities with real Python execution traces captured via sys.settrace(). The model is trained using GRPO with a custom reward function that scores execution prediction accuracy.

๐Ÿ“Š Training Approach:
- Hybrid Magpie-style code generation
- Real execution tracing for ground truth
- Self-supervised learning (no manual annotations!)
- 298 training samples with execution traces

โœจ What it does:
- Predicts variable states at each line of code
- Explains execution flow with thinking tags
- Helps debug by understanding runtime behavior
- Works as a "neural debugger"

๐ŸŽฏ Results:
- 20% overall accuracy on execution prediction
- 33.3% mean state accuracy
- Trained on Qwen3-4B-Thinking (262K context, 4B params)

๐Ÿ”— Links:
Model: codelion/Qwen3-4B-execution-world-model-lora
Dataset: codelion/execution-world-model-dataset
GitHub Recipe: https://github.com/codelion/ellora
Notebook: https://github.com/codelion/ellora/blob/main/Ellora_Recipe_6_Execution_World_Model_Thinking_LoRA.ipynb

Part of the Ellora project - standardized LoRA recipes for enhancing LLM capabilities. All recipes use self-supervised data generation and work with existing infrastructure (PEFT, LoRAX, vLLM).

#LLM #LoRA #CodeGeneration #WorldModel #Qwen #AI #MachineLearning
codelionย 
posted an update about 1 month ago
view post
Post
1882
๐Ÿš€ Adaptive Classifier v0.1.0: Now with ONNX Runtime Support!

We're excited to announce a major update to Adaptive Classifier - a flexible, continuous learning classification system that adapts to new classes without retraining!

What's New:

โšก ONNX Runtime Integration: Get 1.14x faster CPU inference out of the box (up to 4x on x86 processors)

๐Ÿ“ฆ INT8 Quantization: Models are now 4x smaller with minimal accuracy loss, making deployment easier and faster

๐ŸŽฏ Smart Loading: Automatically uses the best model variant for your hardware - quantized for speed by default, or unquantized for maximum accuracy

๐Ÿ”„ 7.5x Faster Model Loading: Get started quickly with optimized model initialization

How It Works:

Adaptive Classifier lets you build text classifiers that continuously learn from new examples without catastrophic forgetting. Perfect for:
- Dynamic classification tasks where classes evolve over time
- Few-shot learning scenarios with limited training data
- Production systems that need to adapt to new categories

The new ONNX support means you get production-ready speed on CPU without any code changes - just load and run!

Try it now:

from adaptive_classifier import AdaptiveClassifier

# Load with ONNX automatically enabled (quantized for best performance)
classifier = AdaptiveClassifier.load("adaptive-classifier/llm-router")

# Add examples dynamically
classifier.add_examples(
["Route this to GPT-4", "Simple task for GPT-3.5"],
["strong", "weak"]
)

# Predict with optimized inference
predictions = classifier.predict("Complex reasoning task")

Check out our LLM Router model to see it in action:
adaptive-classifier/llm-router

GitHub Repository:
https://github.com/codelion/adaptive-classifier

Install now: pip install adaptive-classifier

We'd love to hear your feedback and see what you build with it!

#MachineLearning #NLP #ONNX #ContinuousLearning #TextClassification
codelionย 
posted an update about 2 months ago
view post
Post
3784
๐Ÿš€ Adaptive Classifier v0.0.17 Released - Major Accuracy Improvements!

We've just released a major update fixing critical bugs that were causing 40-50% accuracy drops in our enterprise classifiers!

Key Fixes:
โ€ข Fixed k-parameter prediction bug causing massive accuracy loss
โ€ข Improved incremental learning for new classes
โ€ข Enhanced weight preservation during model updates

Dramatic Results:
โ€ข fraud-detection: 43.9% โ†’ 92.7% (+48.8%) adaptive-classifier/fraud-detection
โ€ข business-sentiment: 88.9% โ†’ 98.8% (+9.9%) adaptive-classifier/business-sentiment expense-category: 26.7% โ†’ 84.2% (+57.5%)
adaptive-classifier/expense-category
โ€ข language-detection: 98.8% โ†’ 100% (+1.2%) adaptive-classifier/language-detection

15/17 enterprise classifiers now maintain โ‰ค5% accuracy difference from original performance!

Other High-Performing Models:
โ€ข email-security (93.8%): adaptive-classifier/email-security
โ€ข content-moderation (100%): adaptive-classifier/content-moderation
โ€ข pii-detection (100%): adaptive-classifier/pii-detection

Quick Start:
from adaptive_classifier import AdaptiveClassifier
classifier = AdaptiveClassifier.load("adaptive-classifier/fraud-detection")
predictions = classifier.predict("Suspicious transaction pattern", k=3)

Install: pip install --upgrade adaptive-classifier==0.0.17

All models: adaptive-classifier

๐ŸŽฏ Production-ready continuous learning for enterprise text classification!

#MachineLearning #TextClassification #ContinualLearning #EnterpriseAI
codelionย 
posted an update 3 months ago
view post
Post
426
Over 40 percent of AI-generated code contains security vulnerabilities. We recently worked on a LoRA to write secure code by default using automated Semgrep analysis and GRPO, achieving 97 percent reduction in vulnerabilities without requiring security-specific prompts.

Technical Approach:
Automated security training pipeline combining Semgrep vulnerability detection with preference learning. Generate multiple solutions with varying security awareness, automatically analyze for vulnerabilities, create preference pairs based on security scores, train using GRPO with multi-factor scoring.

Scoring System (100 points total):
- Functionality: 40 points - Does the code work correctly
- Security patterns: 40 points - Uses secure coding practices
- Low vulnerabilities: 20 points - Semgrep score below threshold

This balanced scoring prevents reward hacking where models generate empty functions to avoid vulnerabilities.

Real Transformation Examples:

Database query before:
query = f"SELECT * FROM products WHERE name = '{name}'"

Database query after:
query = "SELECT * FROM products WHERE name = ?"
db.execute(query, (name,))

Password hashing before:
password_hash = hashlib.md5(password).hexdigest()

Password hashing after:
salt = bcrypt.gensalt(rounds=12)
password_hash = bcrypt.hashpw(password.encode('utf-8'), salt)

Model: codelion/Qwen2.5-Coder-0.5B-Instruct-security-grpo-lora
Notebook: https://github.com/codelion/ellora/blob/main/Ellora_Recipe_5_Secure_Code_Generation_LoRA.ipynb
Repository: https://github.com/codelion/ellora
codelionย 
posted an update 3 months ago
view post
Post
6172
I recently worked on a LoRA that improves tool use in LLM. Thought the approach might interest folks here.

The issue I have had when trying to use some of the local LLMs with coding agents is this:

Me: "Find all API endpoints with authentication in this codebase"
LLM: "You should look for @app .route decorators and check if they have auth middleware..."

But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.

To fine-tune it for tool use I combined two data sources:

1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits)
2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses

This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).

Tools We Taught:
- read_file - Actually read file contents
- search_files - Regex/pattern search across codebases
- find_definition - Locate classes/functions
- analyze_imports - Dependency tracking
- list_directory - Explore structure
- run_tests - Execute test suites

Improvements:
- Tool calling accuracy: 12% โ†’ 80%
- Correct parameters: 8% โ†’ 87%
- Multi-step tasks: 3% โ†’ 78%
- End-to-end completion: 5% โ†’ 80%
- Tools per task: 0.2 โ†’ 3.8

The LoRA really improves on intential tool call as an example consider the query: "Find ValueError in payment module"

The response proceeds as follows:

1. Calls search_files with pattern "ValueError"
2. Gets 4 matches across 3 files
3. Calls read_file on each match
4. Analyzes context
5. Reports: "Found 3 ValueError instances: payment/processor.py:47 for invalid amount, payment/validator.py:23 for unsupported currency..."

Resources:
- Colab notebook https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_3_Enhanced_Tool_Calling_and_Code_Understanding.ipynb
- Model - codelion/Llama-3.2-1B-Instruct-tool-calling-lora
- GitHub - https://github.com/codelion/ellora
codelionย 
posted an update 3 months ago
view post
Post
5267
I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.

Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.

Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).

We saw similar results on Qwen3-0.6B:

Perplexity: 2.40 โ†’ 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions

- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora

Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.

Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!
codelionย 
posted an update 3 months ago
view post
Post
4950
I recently added a recipe in ellora to improve reasoning capabilities to Gemma-3-1B using self-supervised learning. Model now shows step-by-step thinking in <think> tags before answering.

Logic puzzle accuracy: 61% โ†’ 84%. 3 hours training on single GPU. ๐Ÿง 

Used GRPO where model generates multiple responses and learns to prefer better reasoning. Works surprisingly well for making smaller models more transparent.

๐Ÿ”— Colab: https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_2_Reasoning_LoRA_with_Self-Rewarding_GRPO.ipynb

๐Ÿค— Model: codelion/gemma-3-1b-it-reasoning-grpo-lora

๐Ÿ’ป Code: https://github.com/codelion/ellora
  • 1 reply
ยท
codelionย 
posted an update 3 months ago
view post
Post
4708
Released 17 production-ready adaptive text classifiers that learn from just 100 examples per class and continuously improve without retraining.

These models achieve 93% average accuracy across enterprise use cases like email routing, fraud detection, document classification, and support ticket categorization. Built on ModernBERT with prototype memory and elastic weight consolidation.

Key benefits: 90% cost reduction vs API solutions, 90-120ms local inference, dynamic class addition, and zero vendor lock-in.

All models available under adaptive-classifier organization. Install with pip install adaptive-classifier.

Full technical details: https://huggingface.co/blog/codelion/enterprise-ready-classifiers
Code: https://github.com/codelion/adaptive-classifier
  • 2 replies
ยท
codelionย 
posted an update 3 months ago
view post
Post
3666
๐Ÿงฌ We just published our comprehensive analysis of OpenEvolve - an open-source evolutionary coding agent that automatically optimizes algorithms using LLMs!

Our key findings from 29 experiments across 10 models:

- Gemini Flash 2.5 achieved 2.04x speedup across 30 benchmark tasks
- Open models like Gemma 3 27B (1.63x) and Qwen3-Coder 480B (1.41x) rivaled proprietary models
- The system discovered entirely new algorithms - not just code optimizations!
- One task evolved from DFS to BFS to Union-Find approaches
- Specialized coding models outperformed much larger general models 200 iterations beat 100 iterations by 24%
- Ensembles surprisingly failed due to conflicting optimization strategies

Most fascinating: watching models evolve code step-by-step, like transforming matrix operations from basic eigendecomposition to vectorized one-liners with 32x speedup.

Our systematic experimental approach reveals that open-source evolutionary coding is becoming seriously competitive with proprietary solutions. We tested everything from temperature settings to evolution strategies to find optimal configurations.

This research shows automated code optimization is ready for real-world applications. The methodology we developed can guide anyone building evolutionary coding systems.

Full paper with code examples, detailed methodology, and all experimental results: https://huggingface.co/blog/driaforall/towards-open-evolutionary-agents

What optimization challenges could benefit from evolutionary approaches in your work?
codelionย 
posted an update 3 months ago
view post
Post
3461
Extended the ICM paper to show cross-model capability transfer - used Qwen3's mathematical reasoning to improve Gemma3 without any human supervision.

Key results:

Qwen3-0.6B: 63.2 โ†’ 66.0 on MATH-500 (+4%)
Gemma3-1B: 41.0 โ†’ 45.6 on MATH-500 (+11%)

The method extracts coherent reasoning patterns from one model via Internal Coherence Maximization, converts them to DPO training data, and uses that to improve a completely different model architecture.
This goes beyond the original ICM paper which only improved models using their own labels. We're showing you can transfer capabilities between any models - imagine extracting capabilities from strong models to improve your local ones.

Models available:

codelion/Qwen3-0.6B-ICM-DPO
codelion/gemma-3-1b-it-ICM-DPO

Complete collection with code and datasets:
codelion/internal-coherence-maximization-687a1bd1c1f5f1d6f76e9b3b

Full methodology and results:
https://huggingface.co/blog/codelion/internal-coherence-maximization

Planning to extend this to code generation next. The approach could enable community-driven capability sharing between different model families without expensive annotation.
codelionย 
posted an update 4 months ago
view post
Post
3359
Implemented Test-Time Diffusion Deep Researcher (TTD-DR) in OptiLLM! ๐Ÿš€

Just shipped a game-changing feature that turns any LLM into a powerful research agent. TTD-DR applies diffusion-inspired techniques to iteratively refine research reports while grounding them in real web sources.

How it works:
โ€ข Generates initial draft
โ€ข Identifies knowledge gaps
โ€ข Searches web for missing info
โ€ข Iteratively refines through "denoising" steps
โ€ข Produces comprehensive reports with 15-30+ sources

The magic? It works with ANY model so you can choose your favorite open-source models on HF!

Key results:
- 47 complex research queries tested
- Every report backed by real web sources
- Quality rivals human research analysts
- No more hallucinations on current events!

Try it:
pip install optillm
Then use "deep_research-your-model-name" as the model identifier

- Implementation: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research
- Paper: https://arxiv.org/abs/2507.16075v1
- Sample reports: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports

Special thanks to the TTD-DR paper authors for this brilliant approach!

#research #llm #opensource #inference
codelionย 
posted an update 4 months ago
view post
Post
1585
New research: Understanding how different LLMs approach reasoning through "thought anchors"

I just published a comparative study analyzing the reasoning patterns of Qwen3-0.6B vs DeepSeek-R1-Distill-1.5B using thought anchors - critical sentences that significantly impact task success probability.

Key findings:
- DeepSeek-R1: Uses concentrated reasoning with fewer, high-impact steps (0.408 avg impact)
- Qwen3: Employs distributed reasoning spreading impact across multiple steps (0.278 avg impact)
- Different risk-reward profiles: DeepSeek more consistent (82.7% positive steps), Qwen3 more exploratory (71.6% positive)

This reveals different cognitive architectures rather than simple performance differences. The models optimize for different reasoning strategies - consistency vs exploration.

Both datasets are now available on HF:
- Qwen3 thought anchors: codelion/Qwen3-0.6B-pts-thought-anchors
- DeepSeek-R1 thought anchors: codelion/DeepSeek-R1-Distill-Qwen-1.5B-pts-thought-anchors

Built using our open-source PTS library for mechanistic interpretability analysis. All methodology is fully reproducible.

Full article: https://huggingface.co/blog/codelion/understanding-model-reasoning-thought-anchors

What reasoning patterns have you noticed in your model experiments? Would love to hear about other architectures showing similar cognitive diversity!
codelionย 
posted an update 4 months ago
view post
Post
472
New SOTA for 26-circle packing problem! ypwang61 achieved 2.635977 sum of radii using OpenEvolve evolutionary optimization framework.

Progress: AlphaEvolve originally reported 2.635 in their paper, OpenEvolve made improvements, and now we have this new record at 2.635977.

The solution uses multi-stage optimization with specialized pattern initialization and enhanced penalty functions. Circle packing is a notoriously hard optimization problem where these small improvements actually represent significant algorithmic advances.

Great example of how evolutionary algorithms can push boundaries in computational geometry optimization. The implementation and results are shared openly on GitHub.

Link: https://github.com/codelion/openevolve/issues/156
codelionย 
posted an update 5 months ago
view post
Post
2403
๐Ÿš€ Just published: "OpenEvolve: Open-Source Evolutionary Code Optimization with Real-World GPU Kernel Discovery"

We built the first open-source implementation of Google's AlphaEvolve system and used it to automatically discover GPU kernel optimizations that outperform human engineers!

Key results:

- 21.8% average decode speed improvement on Apple Silicon
- 36.7% improvement on long-context transformer attention
- Discovered novel vectorization patterns and 2-pass softmax algorithm

The system evolved a Metal kernel for Qwen3's Grouped Query Attention from a basic 3-pass implementation into something with sophisticated Apple Silicon optimizations that would take experts months to discover manually. The evolved kernel automatically found the optimal vec<T,8> operations for 128-dim attention heads and fused softmax computation with value accumulation.

Really excited about the potential here - imagine evolutionary algorithms automatically discovering optimizations across all our AI infrastructure. What would you want to optimize with this approach?

Full write-up: https://huggingface.co/blog/codelion/openevolve-gpu-kernel-discovery

GitHub: https://github.com/codelion/openevolve

#AI #MachineLearning #GPU #OpenSource #Evolution #CodeOptimization #TransformerOptimization
  • 1 reply
ยท
codelionย 
posted an update 5 months ago
view post
Post
2559
Adaptive Classifier: Dynamic Text Classification with Strategic Learning

New text classification system that learns continuously without catastrophic forgetting. Achieved 22.2% robustness improvement on adversarial datasets while maintaining clean data performance.

๐ŸŽฏ THE PROBLEM
Traditional classifiers require complete retraining when adding new classes. Expensive and time-consuming, especially with adversarial users trying to game the system.

๐Ÿš€ KEY INNOVATIONS
โ€ข Hybrid memory-neural architecture (prototype-based + neural adaptation)
โ€ข Strategic classification using game theory to predict and defend against manipulation
โ€ข Elastic Weight Consolidation prevents catastrophic forgetting

๐Ÿ“Š RESULTS
Tested on AI-Secure/adv_glue dataset:
โ€ข Clean data: 80.0% โ†’ 82.2% (+2.2%)
โ€ข Manipulated data: 60.0% โ†’ 82.2% (+22.2%)
โ€ข Zero performance drop under adversarial attacks

๐Ÿ”ฌ APPLICATIONS
โ€ข Hallucination detection: 80.7% recall for RAG safety
โ€ข LLM routing: 26.6% cost optimization improvement
โ€ข Content moderation: Robust against gaming attempts

โš™๏ธ USAGE
pip install adaptive-classifier

from adaptive_classifier import AdaptiveClassifier
classifier = AdaptiveClassifier("bert-base-uncased")
classifier.add_examples(texts, labels)
predictions = classifier.predict("New text")

๐Ÿ”— RESOURCES
Blog: https://huggingface.co/blog/codelion/adaptive-classifier
Code: https://github.com/codelion/adaptive-classifier
Models: adaptive-classifier

Available models: llm-hallucination-detector, llm-config-optimizer, llm-router

Works with any HuggingFace transformer. Fully open source and production-ready!