MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
Abstract
MagicQuill V2 introduces a layered composition paradigm for generative image editing, combining diffusion models with granular control, enabling clear separation and manipulation of user intentions for content, position, shape, and color.
We propose MagicQuill V2, a novel system that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
Community
• Project Page: https://magicquill.art/v2
• Code: https://github.com/zliucz/MagicQuillV2
• HuggingFace Space: https://huggingface.co/spaces/AI4Editing/MagicQuillV2
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LayerComposer: Multi-Human Personalized Generation via Layered Canvas (2025)
- Canvas-to-Image: Compositional Image Generation with Multimodal Controls (2025)
- SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design (2025)
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment (2025)
- Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes (2025)
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization (2025)
- Layer-Aware Video Composition via Split-then-Merge (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 3
Collections including this paper 0
No Collection including this paper
