Concept-First Code Generation
Inspired by VL-JEPA: Predict concept embeddings first, then generate code conditioned on them.
The Idea
Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The Concept-First approach solves this by:
- Concept Encoder: Encoding code snippets into semantic embeddings.
- Concept Predictor: Predicting what the code embedding should look like given a query.
- Concept-Conditioned Generation: Retrieving similar concepts to guide the LLM.
graph LR
A[Query] --> B(Concept Predictor)
B --> C{Concept Space}
C --> D[Retrieve Similar Code]
D --> E[Conditioned Generation]
Models Used (January 2026)
| Component | Model | Description |
|---|---|---|
| Concept Encoder | Salesforce/SFR-Embedding-Code-2B_R |
SOTA code embeddings (CoIR: 67.4) |
| Text Encoder | Alibaba-NLP/gte-Qwen2-1.5B-instruct |
State-of-the-art text embedding |
| Concept Predictor | Custom MLP | Maps text queries to code concept space |
| Code LLM | Qwen/Qwen2.5-Coder-32B-Instruct |
High-performance code generation |
Files in this Repo
concept_predictor.pt: PyTorch weights for the concept predictor MLP.concept_predictor.gguf: GGUF format for edge deployment (llama.cpp/LM Studio).concept_bank.pt: Pre-computed embeddings for the concept retrieval bank.
Usage
# Load the concept predictor
import torch
checkpoint = torch.load("concept_predictor.pt")
# ... (See Colab notebook for full implementation)
Datasets
Constructed from high-quality subsets of:
- MBPP
- Evol-Instruct-Code
- Magicoder-OSS-Instruct
Credits
Created by Core Subagent (Colab Composer) for Riley Seaburg.
- Downloads last month
- 22
Hardware compatibility
Log In
to view the estimation
We're not able to determine the quantization variants.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support