laion/freesound-commercially-permissive-subset-with-captions Viewer • Updated 32 minutes ago • 397k • 33 • 2
ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods Paper • 2110.02871 • Published Oct 6, 2021
MuPT: A Generative Symbolic Music Pretrained Transformer Paper • 2404.06393 • Published Apr 9, 2024 • 16
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation Paper • 2211.06687 • Published Nov 12, 2022 • 4
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14
A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning Paper • 2507.06542 • Published Jul 9
MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation Paper • 2406.07529 • Published Jun 11, 2024
Improving GUI Grounding with Explicit Position-to-Coordinate Mapping Paper • 2510.03230 • Published Oct 3 • 3
Chronological Thinking in Full-Duplex Spoken Dialogue Language Models Paper • 2510.05150 • Published Oct 2
Scope: Selective Cross-modal Orchestration of Visual Perception Experts Paper • 2510.12974 • Published 25 days ago
InteractComp: Evaluating Search Agents With Ambiguous Queries Paper • 2510.24668 • Published 11 days ago • 96
view post Post 3171 Trained a model for emotion-controllable TTS based on MiMo audio on LAION's dataset.Still very early and does have an issue with hallucinating but results seem pretty good so far, given that it is very early into the training run.Will probably kick off a new run later with some settings tweaked.Put up a demo here: mrfakename/EmoAct-MiMo(Turn 🔊 on to hear audio samples) See translation 4 replies · 🔥 9 9 + Reply
Scope: Selective Cross-modal Orchestration of Visual Perception Experts Paper • 2510.12974 • Published 25 days ago
VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering Paper • 2510.10828 • Published 27 days ago • 1
VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering Paper • 2510.10828 • Published 27 days ago • 1
Improving GUI Grounding with Explicit Position-to-Coordinate Mapping Paper • 2510.03230 • Published Oct 3 • 3
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Paper • 2509.25531 • Published Sep 29 • 7