Concept-First Code Generation

Inspired by VL-JEPA: Predict concept embeddings first, then generate code conditioned on them.

The Idea

Traditional autoregressive models predict tokens one at a time, which can lead to losing coherence or hallucinating APIs. The Concept-First approach solves this by:

  1. Concept Encoder: Encoding code snippets into semantic embeddings.
  2. Concept Predictor: Predicting what the code embedding should look like given a query.
  3. Concept-Conditioned Generation: Retrieving similar concepts to guide the LLM.
graph LR
    A[Query] --> B(Concept Predictor)
    B --> C{Concept Space}
    C --> D[Retrieve Similar Code]
    D --> E[Conditioned Generation]

Models Used (January 2026)

Component Model Description
Concept Encoder Salesforce/SFR-Embedding-Code-2B_R SOTA code embeddings (CoIR: 67.4)
Text Encoder Alibaba-NLP/gte-Qwen2-1.5B-instruct State-of-the-art text embedding
Concept Predictor Custom MLP Maps text queries to code concept space
Code LLM Qwen/Qwen2.5-Coder-32B-Instruct High-performance code generation

Files in this Repo

  • concept_predictor.pt: PyTorch weights for the concept predictor MLP.
  • concept_predictor.gguf: GGUF format for edge deployment (llama.cpp/LM Studio).
  • concept_bank.pt: Pre-computed embeddings for the concept retrieval bank.

Usage

# Load the concept predictor
import torch
checkpoint = torch.load("concept_predictor.pt")
# ... (See Colab notebook for full implementation)

Datasets

Constructed from high-quality subsets of:

  • MBPP
  • Evol-Instruct-Code
  • Magicoder-OSS-Instruct

Credits

Created by Core Subagent (Colab Composer) for Riley Seaburg.

Downloads last month
22
GGUF
Model size
14.7M params
Architecture
concept_predictor
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support