AbstractPhil
/

vae-lyra-sdxl-t5xl

text-embeddings

stable-diffusion

Model card Files Files and versions

vae-lyra-sdxl-t5xl / README.md

AbstractPhil's picture

Update README.md

24fe2e9 verified 4 months ago

|

history blame contribute delete

2.98 kB

	---
	tags:
	- vae
	- multimodal
	- text-embeddings
	- clip
	- t5
	- sdxl
	- stable-diffusion
	license: mit
	---

	# Overcooked and refactored

	I cooked an incorrect cross-attention mechanism based version where the clip_l was learning from the clip_g.
	This next version will have correctly decoupled behaviors and the correct cantor cross-attention formula.
	The simple logistics outcome is latents modified by invalid projections - not even cantor - and the representations
	housing the incorrect behavior towards the t5_xl. This version essentially weighted 20/60/5% reconstruction accuracy; L/G/T5.
	This means that it failed for obvious reasons. The t5 is supposed to come out different, but the L and G are supposed to be useful.

	Apologies for the incorrect formulas.

	# VAE Lyra 🎵 - SDXL Edition

	Multi-modal Variational Autoencoder for SDXL text embedding transformation using geometric fusion.
	Fuses CLIP-L, CLIP-G, and T5-XXL into a unified latent space.

	## Model Details

	- Fusion Strategy: cantor
	- Latent Dimension: 2048
	- Training Steps: 15,634
	- Best Loss: 0.3316

	## Architecture

	- Modalities:
	- CLIP-L (768d) - SDXL text_encoder
	- CLIP-G (1280d) - SDXL text_encoder_2
	- T5-XXL (2048d) - Additional conditioning
	- Encoder Layers: 3
	- Decoder Layers: 3
	- Hidden Dimension: 1024

	## SDXL Compatibility

	This model outputs both CLIP embeddings needed for SDXL:
	- `clip_l`: [batch, 77, 768] → text_encoder output
	- `clip_g`: [batch, 77, 1280] → text_encoder_2 output

	T5-XXL information is encoded into the latent space but not directly output.

	## Usage
	```python
	from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
	from huggingface_hub import hf_hub_download
	import torch

	# Download model
	model_path = hf_hub_download(
	repo_id="AbstractPhil/vae-lyra-sdxl-t5xl",
	filename="model.pt"
	)

	# Load checkpoint
	checkpoint = torch.load(model_path)

	# Create model
	config = MultiModalVAEConfig(
	modality_dims={"clip_l": 768, "clip_g": 1280, "t5_xl": 2048},
	latent_dim=2048,
	fusion_strategy="cantor"
	)

	model = MultiModalVAE(config)
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Use model - train on all three
	inputs = {
	"clip_l": clip_l_embeddings, # [batch, 77, 768]
	"clip_g": clip_g_embeddings, # [batch, 77, 1280]
	"t5_xl": t5_xl_embeddings # [batch, 77, 2048]
	}

	# For SDXL inference - only decode CLIP outputs
	recons, mu, logvar = model(inputs, target_modalities=["clip_l", "clip_g"])

	# Use recons["clip_l"] and recons["clip_g"] with SDXL
	```

	## Training Details

	- Trained on 50,000 diverse prompts
	- Mix of LAION flavors (95%) and synthetic prompts (5%)
	- KL Annealing: True
	- Learning Rate: 0.0001

	## Citation
	```bibtex
	@software{vae_lyra_sdxl_2025,
	author = {AbstractPhil},
	title = {VAE Lyra SDXL: Multi-Modal Variational Autoencoder},
	year = {2025},
	url = {https://huggingface.co/AbstractPhil/vae-lyra-sdxl-t5xl}
	}
	```