| | --- |
| | tags: |
| | - vae |
| | - multimodal |
| | - text-embeddings |
| | - clip |
| | - t5 |
| | - sdxl |
| | - stable-diffusion |
| | license: mit |
| | --- |
| | |
| | # Overcooked and refactored |
| |
|
| | I cooked an incorrect cross-attention mechanism based version where the clip_l was learning from the clip_g. |
| | This next version will have correctly decoupled behaviors and the correct cantor cross-attention formula. |
| | The simple logistics outcome is latents modified by invalid projections - not even cantor - and the representations |
| | housing the incorrect behavior towards the t5_xl. This version essentially weighted 20/60/5% reconstruction accuracy; L/G/T5. |
| | This means that it failed for obvious reasons. The t5 is supposed to come out different, but the L and G are supposed to be useful. |
| | |
| | Apologies for the incorrect formulas. |
| | |
| | # VAE Lyra 🎵 - SDXL Edition |
| | |
| | Multi-modal Variational Autoencoder for SDXL text embedding transformation using geometric fusion. |
| | Fuses CLIP-L, CLIP-G, and T5-XXL into a unified latent space. |
| | |
| | ## Model Details |
| | |
| | - **Fusion Strategy**: cantor |
| | - **Latent Dimension**: 2048 |
| | - **Training Steps**: 15,634 |
| | - **Best Loss**: 0.3316 |
| | |
| | ## Architecture |
| | |
| | - **Modalities**: |
| | - CLIP-L (768d) - SDXL text_encoder |
| | - CLIP-G (1280d) - SDXL text_encoder_2 |
| | - T5-XXL (2048d) - Additional conditioning |
| | - **Encoder Layers**: 3 |
| | - **Decoder Layers**: 3 |
| | - **Hidden Dimension**: 1024 |
| |
|
| | ## SDXL Compatibility |
| |
|
| | This model outputs both CLIP embeddings needed for SDXL: |
| | - `clip_l`: [batch, 77, 768] → text_encoder output |
| | - `clip_g`: [batch, 77, 1280] → text_encoder_2 output |
| |
|
| | T5-XXL information is encoded into the latent space but not directly output. |
| |
|
| | ## Usage |
| | ```python |
| | from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig |
| | from huggingface_hub import hf_hub_download |
| | import torch |
| | |
| | # Download model |
| | model_path = hf_hub_download( |
| | repo_id="AbstractPhil/vae-lyra-sdxl-t5xl", |
| | filename="model.pt" |
| | ) |
| | |
| | # Load checkpoint |
| | checkpoint = torch.load(model_path) |
| | |
| | # Create model |
| | config = MultiModalVAEConfig( |
| | modality_dims={"clip_l": 768, "clip_g": 1280, "t5_xl": 2048}, |
| | latent_dim=2048, |
| | fusion_strategy="cantor" |
| | ) |
| | |
| | model = MultiModalVAE(config) |
| | model.load_state_dict(checkpoint['model_state_dict']) |
| | model.eval() |
| | |
| | # Use model - train on all three |
| | inputs = { |
| | "clip_l": clip_l_embeddings, # [batch, 77, 768] |
| | "clip_g": clip_g_embeddings, # [batch, 77, 1280] |
| | "t5_xl": t5_xl_embeddings # [batch, 77, 2048] |
| | } |
| | |
| | # For SDXL inference - only decode CLIP outputs |
| | recons, mu, logvar = model(inputs, target_modalities=["clip_l", "clip_g"]) |
| | |
| | # Use recons["clip_l"] and recons["clip_g"] with SDXL |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | - Trained on 50,000 diverse prompts |
| | - Mix of LAION flavors (95%) and synthetic prompts (5%) |
| | - KL Annealing: True |
| | - Learning Rate: 0.0001 |
| |
|
| | ## Citation |
| | ```bibtex |
| | @software{vae_lyra_sdxl_2025, |
| | author = {AbstractPhil}, |
| | title = {VAE Lyra SDXL: Multi-Modal Variational Autoencoder}, |
| | year = {2025}, |
| | url = {https://huggingface.co/AbstractPhil/vae-lyra-sdxl-t5xl} |
| | } |
| | ``` |
| |
|