vae-lyra-sdxl-t5xl / README.md
AbstractPhil's picture
Update README.md
24fe2e9 verified
---
tags:
- vae
- multimodal
- text-embeddings
- clip
- t5
- sdxl
- stable-diffusion
license: mit
---
# Overcooked and refactored
I cooked an incorrect cross-attention mechanism based version where the clip_l was learning from the clip_g.
This next version will have correctly decoupled behaviors and the correct cantor cross-attention formula.
The simple logistics outcome is latents modified by invalid projections - not even cantor - and the representations
housing the incorrect behavior towards the t5_xl. This version essentially weighted 20/60/5% reconstruction accuracy; L/G/T5.
This means that it failed for obvious reasons. The t5 is supposed to come out different, but the L and G are supposed to be useful.
Apologies for the incorrect formulas.
# VAE Lyra 🎵 - SDXL Edition
Multi-modal Variational Autoencoder for SDXL text embedding transformation using geometric fusion.
Fuses CLIP-L, CLIP-G, and T5-XXL into a unified latent space.
## Model Details
- **Fusion Strategy**: cantor
- **Latent Dimension**: 2048
- **Training Steps**: 15,634
- **Best Loss**: 0.3316
## Architecture
- **Modalities**:
- CLIP-L (768d) - SDXL text_encoder
- CLIP-G (1280d) - SDXL text_encoder_2
- T5-XXL (2048d) - Additional conditioning
- **Encoder Layers**: 3
- **Decoder Layers**: 3
- **Hidden Dimension**: 1024
## SDXL Compatibility
This model outputs both CLIP embeddings needed for SDXL:
- `clip_l`: [batch, 77, 768] → text_encoder output
- `clip_g`: [batch, 77, 1280] → text_encoder_2 output
T5-XXL information is encoded into the latent space but not directly output.
## Usage
```python
from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="AbstractPhil/vae-lyra-sdxl-t5xl",
filename="model.pt"
)
# Load checkpoint
checkpoint = torch.load(model_path)
# Create model
config = MultiModalVAEConfig(
modality_dims={"clip_l": 768, "clip_g": 1280, "t5_xl": 2048},
latent_dim=2048,
fusion_strategy="cantor"
)
model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Use model - train on all three
inputs = {
"clip_l": clip_l_embeddings, # [batch, 77, 768]
"clip_g": clip_g_embeddings, # [batch, 77, 1280]
"t5_xl": t5_xl_embeddings # [batch, 77, 2048]
}
# For SDXL inference - only decode CLIP outputs
recons, mu, logvar = model(inputs, target_modalities=["clip_l", "clip_g"])
# Use recons["clip_l"] and recons["clip_g"] with SDXL
```
## Training Details
- Trained on 50,000 diverse prompts
- Mix of LAION flavors (95%) and synthetic prompts (5%)
- KL Annealing: True
- Learning Rate: 0.0001
## Citation
```bibtex
@software{vae_lyra_sdxl_2025,
author = {AbstractPhil},
title = {VAE Lyra SDXL: Multi-Modal Variational Autoencoder},
year = {2025},
url = {https://huggingface.co/AbstractPhil/vae-lyra-sdxl-t5xl}
}
```