@Kseniase on Hugging Face: "12 Types of JEPA Since Yann LeCun together with Randall Balestriero released…"

Post

5846

12 Types of JEPA

Since Yann LeCun together with Randall Balestriero released a new paper on JEPA (Joint-Embedding Predictive Architecture), laying out its theory and introducing an efficient practical version called LeJEPA, we figured you might need even more JEPA. Here are 7 recent JEPA variants plus 5 iconic ones:

1. LeJEPA → LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics (2511.08544)
Explains a full theory for JEPAs, defining the “ideal” JEPA embedding as an isotropic Gaussian, and proposes the SIGReg objective to push JEPA toward this ideal, resulting in practical LeJEPA

2. JEPA-T → JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2510.00974)
A text-to-image model that tokenizes images and captions with a joint predictive Transformer, enhances fusion with cross-attention and text embeddings before training loss, and generates images by iteratively denoising visual tokens conditioned on text

3. Text-JEPA → Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems (2507.20491)
Converts natural language into first-order logic, with a Z3 solver handling reasoning, enabling efficient, explainable QA with far lower compute than large LLMs

4. N-JEPA (Noise-based JEPA) → Improving Joint Embedding Predictive Architecture with Diffusion Noise (2507.15216)
Connects self-supervised learning with diffusion-style noise by using noise-based masking and multi-level schedules, especially improving visual classification

5. SparseJEPA → SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures (2504.16140)
Adds sparse representation learning to make embeddings more interpretable and efficient. It groups latent variables by shared semantic structure using a sparsity penalty while preserving accuracy

6. TS-JEPA (Time Series JEPA) → Joint Embeddings Go Temporal (2509.25449)
Adapts JEPA to time-series by learning latent self-supervised representations and predicting future latents for robustness to noise and confounders

Read further below ↓
It you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

TD-JEPA (Temporal difference JEPA) → https://huggingface.co/papers/2510.00739
An unsupervised RL method that uses TD learning to model long-term latent dynamics, training encoders and a policy-conditioned predictor for zero-shot reward optimization

5 Iconic JEPA types:

I-JEPA (Image-based) → https://huggingface.co/papers/2301.08243
Masks out parts of an image and predicts their latent representation from the remaining context region. Uses Vision Transformers; no pixel-level reconstruction needed
V-JEPA (Video-based) → https://huggingface.co/papers/2404.08471
Predicts future or missing frame embeddings from observed frames. Learns temporal dynamics without contrastive negatives or text supervision

V-JEPA 2 trained on 1M+ hours of internet videos and a little bit of robot interaction data. It can watch, understand, answer questions, and help robots plan and act in physical world → https://huggingface.co/papers/2506.09985

MC-JEPA (Motion-Content) → https://huggingface.co/papers/2307.12698
Jointly learns motion (optical flow) and content features with a shared encoder. It combines a flow prediction task with a standard image representation task (VICReg) in one model
A-JEPA (Audio-based) → https://huggingface.co/papers/2311.15830
Extends JEPA to audio spectrograms. Masks time-frequency patches of the spectrogram (with a curriculum strategy) and predicts their latent features from the unmasked context
TI-JEPA (Text-Image) → https://huggingface.co/papers/2503.06380
Aligns text and image embeddings in a shared latent space via an energy-based predictive objective

We break down how JEPA works and its main ideas in this comprehensive article: https://www.turingpost.com/p/jepa

Check out more JEPA types here:

Join the conversation