-
coqui/XTTS-v2
Text-to-Speech • Updated • 4.95M • 3.18k -
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 208k • • 3.08k -
openai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.2M • • 5.11k -
Distilling an End-to-End Voice Assistant Without Instruction Training Data
Paper • 2410.02678 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2410.02678
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 21
-
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Paper • 2403.02677 • Published • 18 -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Paper • 2403.03003 • Published • 11 -
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper • 2403.01487 • Published • 16 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 46
-
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
Paper • 2410.00201 • Published -
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems
Paper • 2409.19804 • Published -
Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Paper • 2409.15156 • Published -
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Paper • 2409.04927 • Published
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
-
coqui/XTTS-v2
Text-to-Speech • Updated • 4.95M • 3.18k -
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 208k • • 3.08k -
openai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.2M • • 5.11k -
Distilling an End-to-End Voice Assistant Without Instruction Training Data
Paper • 2410.02678 • Published • 23
-
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
Paper • 2410.00201 • Published -
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems
Paper • 2409.19804 • Published -
Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Paper • 2409.15156 • Published -
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Paper • 2409.04927 • Published
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 21
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
-
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Paper • 2403.02677 • Published • 18 -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Paper • 2403.03003 • Published • 11 -
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper • 2403.01487 • Published • 16 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 46