AI-paper - a shankars Collection

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published Aug 13 • 5

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14 • 18

ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6 • 1

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

Paper • 2508.13167 • Published Aug 6 • 127

Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Paper • 2508.12800 • Published Aug 18 • 5

Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

Paper • 2508.11548 • Published Aug 15 • 5

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Paper • 2508.08777 • Published Aug 12 • 15

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Paper • 2508.09131 • Published Aug 12 • 16

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Paper • 2508.14704 • Published Aug 20 • 42

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

Paper • 2508.14111 • Published Aug 18 • 33

RynnEC: Bringing MLLMs into Embodied World

Paper • 2508.14160 • Published Aug 19 • 19

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8 • 185

Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17 • 115

A Survey on Large Language Model Benchmarks

Paper • 2508.15361 • Published Aug 21 • 20

Deep Think with Confidence

Paper • 2508.15260 • Published Aug 21 • 88

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9 • 15

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Paper • 2504.15279 • Published Apr 21 • 77

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Paper • 2406.14562 • Published Jun 20, 2024 • 28

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10 • 65

Thinking with Generated Images

Paper • 2505.22525 • Published May 28 • 15

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Paper • 2505.13444 • Published May 19 • 17

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1, 2024 • 82

ComposeAnything: Composite Object Priors for Text-to-Image Generation

Paper • 2505.24086 • Published May 30 • 5

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Paper • 2506.23918 • Published Jun 30 • 88

Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published May 16 • 57

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 47

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Paper • 2403.12884 • Published Mar 19, 2024 • 1

CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

Paper • 2504.10090 • Published Apr 14

Visual Programming: Compositional visual reasoning without training

Paper • 2211.11559 • Published Nov 18, 2022 • 1

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 9

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 20

Intern-S1: A Scientific Multimodal Foundation Model

Paper • 2508.15763 • Published Aug 21 • 256

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 110

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 55

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7 • 137

LLM Inference Unveiled: Survey and Roofline Model Insights

Paper • 2402.16363 • Published Feb 26, 2024 • 4

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Paper • 2504.11750 • Published Apr 16

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Paper • 2410.11795 • Published Oct 15, 2024 • 18

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

Paper • 2504.19056 • Published Apr 27 • 18

Personalized Image Generation with Deep Generative Models: A Decade Survey

Paper • 2502.13081 • Published Feb 18

Diffusion Models: A Comprehensive Survey of Methods and Applications

Paper • 2209.00796 • Published Sep 2, 2022

An Empirical Study of GPT-4o Image Generation Capabilities

Paper • 2504.05979 • Published Apr 8 • 64

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

Paper • 2502.09411 • Published Feb 13 • 22

A survey of Generative AI Applications

Paper • 2306.02781 • Published Jun 5, 2023

Text-to-image Diffusion Models in Generative AI: A Survey

Paper • 2303.07909 • Published Mar 14, 2023

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Paper • 2501.06322 • Published Jan 10 • 1

Multi-Agent Collaboration via Evolving Orchestration

Paper • 2505.19591 • Published May 26 • 1

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 22

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

Paper • 2506.12508 • Published Jun 14 • 1

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9, 2024 • 27

VideoTetris: Towards Compositional Text-to-Video Generation

Paper • 2406.04277 • Published Jun 6, 2024 • 25

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Paper • 2407.14505 • Published Jul 19, 2024 • 26

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 20

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 26

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Paper • 2312.14125 • Published Dec 21, 2023 • 47

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

Paper • 2504.03664 • Published Mar 15

FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

Paper • 2503.03777 • Published Mar 4

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Paper • 2503.16163 • Published Mar 20

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

Paper • 2502.12574 • Published Feb 18 • 12

Seesaw: High-throughput LLM Inference via Model Re-sharding

Paper • 2503.06433 • Published Mar 9

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

Paper • 2504.09345 • Published Apr 12

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14 • 301

MV-RAG: Retrieval Augmented Multiview Diffusion

Paper • 2508.16577 • Published Aug 22 • 38

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Paper • 2508.18032 • Published Aug 25 • 41

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

Paper • 2508.17188 • Published Aug 24 • 17

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Paper • 2508.17298 • Published Aug 24 • 4

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22 • 154

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

Paper • 2508.16279 • Published Aug 22 • 52

CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Paper • 2508.15774 • Published Aug 21 • 20

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Paper • 2508.19652 • Published Aug 27 • 84

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27 • 31

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Paper • 2508.20088 • Published Aug 27 • 20

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment

Paper • 2508.19527 • Published Aug 27 • 10

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Paper • 2508.19559 • Published Aug 27 • 6

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published Aug 28 • 35

rStar2-Agent: Agentic Reasoning Technical Report

Paper • 2508.20722 • Published Aug 28 • 115

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Paper • 2508.20751 • Published Aug 28 • 89

AWorld: Orchestrating the Training Recipe for Agentic AI

Paper • 2508.20404 • Published Aug 28 • 38

Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Paper • 2508.21070 • Published Aug 28 • 6

ROSE: Remove Objects with Side Effects in Videos

Paper • 2508.18633 • Published Aug 26 • 7

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Paper • 2508.21112 • Published Aug 28 • 75

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Paper • 2508.18106 • Published Aug 25 • 344

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Paper • 2508.21113 • Published Aug 28 • 109

AHELM: A Holistic Evaluation of Audio-Language Models

Paper • 2508.21376 • Published Aug 29 • 9

Morae: Proactively Pausing UI Agents for User Choices

Paper • 2508.21456 • Published Aug 29 • 5

UItron: Foundational GUI Agent with Advanced Perception and Planning

Paper • 2508.21767 • Published Aug 29 • 12

Efficient Code Embeddings from Code Generation Models

Paper • 2508.21290 • Published Aug 29 • 19

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Paper • 2508.17677 • Published Aug 25 • 14

CLIPSym: Delving into Symmetry Detection with CLIP

Paper • 2508.14197 • Published Aug 19 • 8

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Paper • 2508.21148 • Published Aug 28 • 139

Continual Learning for Large Language Models: A Survey

Paper • 2402.01364 • Published Feb 2, 2024 • 1

Continual Learning with Pre-Trained Models: A Survey

Paper • 2401.16386 • Published Jan 29, 2024 • 1

Continual Learning: Applications and the Road Forward

Paper • 2311.11908 • Published Nov 20, 2023 • 1

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Paper • 2509.02547 • Published Sep 2 • 224

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2 • 83

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Paper • 2508.21496 • Published Aug 29 • 54

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

Paper • 2509.01055 • Published Sep 1 • 73

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Paper • 2509.01215 • Published Sep 1 • 50

GenCompositor: Generative Video Compositing with Diffusion Transformer

Paper • 2509.02460 • Published Sep 2 • 25

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Paper • 2509.01644 • Published Sep 1 • 33

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Paper • 2509.00428 • Published Aug 30 • 17

From Editor to Dense Geometry Estimator

Paper • 2509.04338 • Published Sep 4 • 91

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

Paper • 2508.18733 • Published Aug 26 • 9

Towards a Unified View of Large Language Model Post-Training

Paper • 2509.04419 • Published Sep 4 • 73

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Paper • 2412.03398 • Published Dec 4, 2024 • 2

RecAgent: A Novel Simulation Paradigm for Recommender Systems

Paper • 2306.02552 • Published Jun 5, 2023 • 1

Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

Paper • 2503.11646 • Published Mar 14 • 35

How do language models learn facts? Dynamics, curricula and hallucinations

Paper • 2503.21676 • Published Mar 27 • 1

Investigating Multi-source Active Learning for Natural Language Inference

Paper • 2302.06976 • Published Feb 14, 2023

Targeted Data Acquisition for Evolving Negotiation Agents

Paper • 2106.07728 • Published Jun 14, 2021

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Paper • 2509.06155 • Published Sep 7 • 13

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Paper • 2509.06949 • Published Sep 8 • 56

Reinforced Visual Perception with Tools

Paper • 2509.01656 • Published Sep 1 • 31

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Paper • 2509.06733 • Published Sep 8 • 32

Visual Representation Alignment for Multimodal Large Language Models

Paper • 2509.07979 • Published Sep 9 • 83

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Paper • 2509.06951 • Published Sep 8 • 31

A Survey of Reinforcement Learning for Large Reasoning Models

Paper • 2509.08827 • Published Sep 10 • 188

EnvX: Agentize Everything with Agentic AI

Paper • 2509.08088 • Published Sep 9 • 8

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

Paper • 2509.08494 • Published Sep 10 • 1

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Paper • 2509.09372 • Published Sep 11 • 236

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Paper • 2509.08519 • Published Sep 10 • 127

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Paper • 2509.09674 • Published Sep 11 • 79

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Paper • 2509.09595 • Published Sep 11 • 48

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11 • 31

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Paper • 2509.09286 • Published Sep 11 • 11

Agentic Software Engineering: Foundational Pillars and a Research Roadmap

Paper • 2509.06216 • Published Sep 7 • 7

AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities

Paper • 2508.11126 • Published Aug 15

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Paper • 2508.10146 • Published Aug 13

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Paper • 2509.15020 • Published Sep 18 • 4

Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality

Paper • 2509.10402 • Published Sep 12 • 5

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Paper • 2509.15178 • Published Sep 18 • 6

RecoWorld: Building Simulated Environments for Agentic Recommender Systems

Paper • 2509.10397 • Published Sep 12 • 7

MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

Paper • 2509.14638 • Published Sep 18 • 11

AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17 • 36

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Paper • 2509.13160 • Published Sep 16 • 29

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Paper • 2509.15185 • Published Sep 18 • 29

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Paper • 2509.15194 • Published Sep 18 • 33

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Paper • 2509.15221 • Published Sep 18 • 109

FlowRL: Matching Reward Distributions for LLM Reasoning

Paper • 2509.15207 • Published Sep 18 • 113

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

Paper • 2509.14760 • Published Sep 18 • 52

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 54

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Paper • 2509.15591 • Published Sep 19 • 45

Lynx: Towards High-Fidelity Personalized Video Generation

Paper • 2509.15496 • Published Sep 19 • 12

LIMI: Less is More for Agency

Paper • 2509.17567 • Published Sep 22 • 100

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Paper • 2509.17627 • Published Sep 22 • 66

Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22 • 135

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

Paper • 2509.18091 • Published Sep 22 • 33

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Paper • 2509.18056 • Published Sep 22 • 27

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Paper • 2509.17437 • Published Sep 22 • 17

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Paper • 2509.17396 • Published Sep 22 • 19

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Paper • 2509.16941 • Published Sep 21 • 21

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Paper • 2509.17177 • Published Sep 21 • 13

Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels

Paper • 2509.16596 • Published Sep 20 • 14

Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning

Paper • 2509.18083 • Published Sep 22 • 5

Understanding Embedding Scaling in Collaborative Filtering

Paper • 2509.15709 • Published Sep 19 • 5

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Paper • 2509.17818 • Published Sep 22 • 8

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Paper • 2509.17641 • Published Sep 22 • 4

DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

Paper • 2509.17399 • Published Sep 22 • 2

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

Paper • 2509.16633 • Published Sep 20 • 2

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Paper • 2509.18154 • Published Sep 16 • 50

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Paper • 2509.18824 • Published Sep 23 • 22

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Paper • 2509.19284 • Published Sep 23 • 22

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Paper • 2509.19002 • Published Sep 23 • 2

Video models are zero-shot learners and reasoners

Paper • 2509.20328 • Published Sep 24 • 96

SIM-CoT: Supervised Implicit Chain-of-Thought

Paper • 2509.20317 • Published Sep 24 • 41

EmbeddingGemma: Powerful and Lightweight Text Representations

Paper • 2509.20354 • Published Sep 24 • 39

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Paper • 2509.20360 • Published Sep 24 • 17

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Paper • 2509.20358 • Published Sep 24 • 14

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Paper • 2509.19244 • Published Sep 23 • 11

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Paper • 2509.21164 • Published Sep 25 • 8

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

Paper • 2509.19803 • Published Sep 24 • 118

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Paper • 2509.21320 • Published Sep 25 • 99

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Paper • 2509.21268 • Published Sep 25 • 101

Tree Search for LLM Agent Reinforcement Learning

Paper • 2509.21240 • Published Sep 25 • 87

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Paper • 2509.20427 • Published Sep 24 • 77

AutoIntent: AutoML for Text Classification

Paper • 2509.21138 • Published Sep 25 • 35

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25 • 29

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Paper • 2509.21072 • Published Sep 25 • 15

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Paper • 2509.21278 • Published Sep 25 • 15

Thinking Augmented Pre-training

Paper • 2509.20186 • Published Sep 24 • 23

Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory

Paper • 2509.14662 • Published Sep 18 • 13

SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

Paper • 2509.21318 • Published Sep 25 • 10

Interactive Recommendation Agent with Active User Commands

Paper • 2509.21317 • Published Sep 25 • 6

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

Paper • 2509.19736 • Published Sep 24 • 11

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Paper • 2509.21113 • Published Sep 25 • 5

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Paper • 2509.20414 • Published Sep 24 • 9

Thinking While Listening: Simple Test Time Scaling For Audio Classification

Paper • 2509.19676 • Published Sep 24 • 4

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Paper • 2509.20293 • Published Sep 24 • 7

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Paper • 2509.20109 • Published Sep 24 • 3

CompLLM: Compression for Long Context Q&A

Paper • 2509.19228 • Published Sep 23 • 8

Blueprints of Trust: AI System Cards for End to End Transparency and Governance

Paper • 2509.20394 • Published Sep 23 • 2

StyleBench: Evaluating thinking styles in Large Language Models

Paper • 2509.20868 • Published Sep 25 • 3

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Paper • 2509.19282 • Published Sep 23 • 7

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Paper • 2509.22414 • Published Sep 26 • 21

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Paper • 2509.21760 • Published Sep 26 • 14

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Paper • 2509.22651 • Published Sep 26 • 22

Variational Reasoning for Language Models

Paper • 2509.22637 • Published Sep 26 • 68

LongLive: Real-time Interactive Long Video Generation

Paper • 2509.22622 • Published Sep 26 • 182

A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30 • 46

Evaluating Very Long-Term Conversational Memory of LLM Agents

Paper • 2402.17753 • Published Feb 27, 2024 • 20

VBench: Comprehensive Benchmark Suite for Video Generative Models

Paper • 2311.17982 • Published Nov 29, 2023 • 9

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Paper • 2503.21755 • Published Mar 27 • 33

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

Paper • 2411.13503 • Published Nov 20, 2024 • 34

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Paper • 2406.16855 • Published Jun 24, 2024 • 57

VCBench: Benchmarking LLMs in Venture Capital

Paper • 2509.14448 • Published Sep 17

AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Paper • 2504.20865 • Published Apr 29

ConsumerBench: Benchmarking Generative AI Applications on End-User Devices

Paper • 2506.17538 • Published Jun 21 • 7

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

Paper • 2503.05860 • Published Mar 7 • 11

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Paper • 2507.12284 • Published Jul 16 • 1

Benchmarking Neural Network Training Algorithms

Paper • 2306.07179 • Published Jun 12, 2023 • 23

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Paper • 2406.14991 • Published Jun 21, 2024 • 2

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Paper • 2506.00482 • Published May 31 • 8

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Paper • 2406.15877 • Published Jun 22, 2024 • 48

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Paper • 2407.18961 • Published Jul 18, 2024 • 40

ImgEdit: A Unified Image Editing Dataset and Benchmark

Paper • 2505.20275 • Published May 26 • 18

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3 • 57

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

Paper • 2508.12919 • Published Aug 18 • 1

Instruction-Following Evaluation in Function Calling for Large Language Models

Paper • 2509.18420 • Published Sep 22 • 1

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Paper • 2509.22186 • Published Sep 26 • 132

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Paper • 2509.22220 • Published Sep 26 • 64

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Paper • 2509.24897 • Published Sep 29 • 46

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Paper • 2509.24900 • Published Sep 29 • 53

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14 • 98

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Paper • 2505.02567 • Published May 5 • 80

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Paper • 2406.12644 • Published Jun 18, 2024 • 5

ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies

Paper • 2506.12830 • Published Jun 15

CompBench: Benchmarking Complex Instruction-guided Image Editing

Paper • 2505.12200 • Published May 18

Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

Paper • 2509.01986 • Published Sep 2 • 4

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Paper • 2310.11513 • Published Oct 17, 2023 • 1

Visual Jigsaw Post-Training Improves MLLMs

Paper • 2509.25190 • Published Sep 29 • 35

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Paper • 2509.24695 • Published Sep 29 • 44

Democratizing AI scientists using ToolUniverse

Paper • 2509.23426 • Published Sep 27 • 39

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Paper • 2509.25175 • Published Sep 29 • 29

Towards Personalized Deep Research: Benchmarks and Evaluations

Paper • 2509.25106 • Published Sep 29 • 28

VideoScore2: Think before You Score in Generative Video Evaluation

Paper • 2509.22799 • Published Sep 26 • 24

MMPB: It's Time for Multi-Modal Personalization

Paper • 2509.22820 • Published Sep 26 • 14

Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published Oct 29, 2024 • 33

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Paper • 2509.25161 • Published Sep 29 • 23

HunyuanImage 3.0 Technical Report

Paper • 2509.23951 • Published Sep 28 • 21

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Paper • 2509.25185 • Published Sep 29 • 4

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Paper • 2509.23061 • Published Sep 27 • 6

UniVid: The Open-Source Unified Video Model

Paper • 2509.24200 • Published Sep 29 • 4

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Paper • 2509.23338 • Published Sep 27 • 4

BPMN Assistant: An LLM-Based Approach to Business Process Modeling

Paper • 2509.24592 • Published Sep 29 • 1

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Paper • 2509.23233 • Published Sep 27 • 2

Advancing Reference-free Evaluation of Video Captions with Factual Analysis

Paper • 2509.16538 • Published Sep 20

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Paper • 2509.24002 • Published Sep 28 • 171

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Paper • 2509.26536 • Published Sep 30 • 34

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Paper • 2509.25182 • Published Sep 29 • 36

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Paper • 2509.26625 • Published Sep 30 • 43

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Paper • 2509.26490 • Published Sep 30 • 19

dParallel: Learnable Parallel Decoding for dLLMs

Paper • 2509.26488 • Published Sep 30 • 19

DA^2: Depth Anything in Any Direction

Paper • 2509.26618 • Published Sep 30 • 25

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

Paper • 2509.26329 • Published Sep 30 • 2

Video Object Segmentation-Aware Audio Generation

Paper • 2509.26604 • Published Sep 30 • 1

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Paper • 2509.25248 • Published Sep 27 • 2

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Paper • 2509.26555 • Published Sep 30

Regression Language Models for Code

Paper • 2509.26476 • Published Sep 30 • 16

The Pitfalls of KV Cache Compression

Paper • 2510.00231 • Published Sep 30 • 5

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Paper • 2509.26539 • Published Sep 30 • 8

LayerD: Decomposing Raster Graphic Designs into Layers

Paper • 2509.25134 • Published Sep 29 • 1

Improving Editability in Image Generation with Layer-wise Memory

Paper • 2505.01079 • Published May 2 • 29

Generative Image Layer Decomposition with Visual Effects

Paper • 2411.17864 • Published Nov 26, 2024

Edit Transfer: Learning Image Editing via Vision In-Context Relations

Paper • 2503.13327 • Published Mar 17 • 29

Text2Layer: Layered Image Generation using Latent Diffusion Model

Paper • 2307.09781 • Published Jul 19, 2023 • 15

Code2Video: A Code-centric Paradigm for Educational Video Generation

Paper • 2510.01174 • Published Oct 1 • 33

GEM: A Gym for Agentic LLMs

Paper • 2510.01051 • Published Oct 1 • 88

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Paper • 2510.00232 • Published Sep 30 • 15

In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn Reasoning

Paper • 2510.00777 • Published Oct 1 • 2

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Paper • 2509.19185 • Published Sep 23 • 3

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Paper • 2402.11281 • Published Feb 17, 2024 • 1

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Paper • 2509.25162 • Published Sep 29 • 3

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Paper • 2510.00438 • Published Oct 1 • 7

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

Paper • 2509.26514 • Published Sep 30 • 3

Eliciting Secret Knowledge from Language Models

Paper • 2510.01070 • Published Oct 1 • 4

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Paper • 2510.02283 • Published Oct 2 • 92

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Paper • 2510.02209 • Published Oct 2 • 52

BloombergGPT: A Large Language Model for Finance

Paper • 2303.17564 • Published Mar 30, 2023 • 27

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Paper • 2510.01284 • Published Sep 30 • 32

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

Paper • 2510.02190 • Published Oct 2 • 18

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Paper • 2510.00523 • Published Oct 1 • 6

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24 • 39

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 39

Veagle: Advancements in Multimodal Representation Learning

Paper • 2403.08773 • Published Jan 18, 2024 • 10

CoDA: Agentic Systems for Collaborative Data Visualization

Paper • 2510.03194 • Published Oct 3 • 28

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Paper • 2510.03120 • Published Oct 3 • 6

Paper2Video: Automatic Video Generation from Scientific Papers

Paper • 2510.05096 • Published Oct 6 • 111

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Paper • 2510.05094 • Published Oct 6 • 36

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Paper • 2510.04618 • Published Oct 6 • 120

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Paper • 2510.04800 • Published Oct 6 • 36

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Paper • 2510.03215 • Published Oct 3 • 96

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Paper • 2510.06590 • Published Oct 8 • 70

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published Oct 7 • 53

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Paper • 2510.06917 • Published Oct 8 • 34

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Paper • 2510.08540 • Published Oct 9 • 108

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Paper • 2510.07310 • Published Oct 8 • 35

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Paper • 2510.06710 • Published Oct 8 • 38

Vibe Checker: Aligning Code Evaluation with Human Preference

Paper • 2510.07315 • Published Oct 8 • 32

Online Generic Event Boundary Detection

Paper • 2510.06855 • Published Oct 8 • 3

Bridging Text and Video Generation: A Survey

Paper • 2510.04999 • Published Oct 6 • 3

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Paper • 2510.07041 • Published Oct 8 • 3

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

Paper • 2509.21842 • Published Sep 26 • 2

Agent Learning via Early Experience

Paper • 2510.08558 • Published Oct 9 • 262

UniVideo: Unified Understanding, Generation, and Editing for Videos

Paper • 2510.08377 • Published Oct 9 • 70

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Paper • 2510.08143 • Published Oct 9 • 20

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Paper • 2510.03663 • Published Oct 4 • 15

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Paper • 2510.07172 • Published Oct 8 • 28

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Paper • 2510.08555 • Published Oct 9 • 62

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

Paper • 2510.08008 • Published Oct 9 • 5

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

Paper • 2510.07429 • Published Oct 8 • 3

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

Paper • 2510.08276 • Published Oct 9 • 9

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Paper • 2510.08559 • Published Oct 9 • 8

Character Mixing for Video Generation

Paper • 2510.05093 • Published Oct 6 • 6

WithAnyone: Towards Controllable and ID Consistent Image Generation

Paper • 2510.14975 • Published Oct 16 • 80

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Paper • 2510.14979 • Published Oct 16 • 65

Attention Is All You Need for KV Cache in Diffusion LLMs

Paper • 2510.14973 • Published Oct 16 • 38

LLM-guided Hierarchical Retrieval

Paper • 2510.13217 • Published Oct 15 • 16

Qwen3Guard Technical Report

Paper • 2510.14276 • Published Oct 16 • 13

Learning an Image Editing Model without Image Editing Pairs

Paper • 2510.14978 • Published Oct 16 • 7

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Paper • 2510.14974 • Published Oct 16 • 9

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Paper • 2510.13910 • Published Oct 15 • 1

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Paper • 2510.21618 • Published 24 days ago • 95

Video-As-Prompt: Unified Semantic Control for Video Generation

Paper • 2510.20888 • Published 25 days ago • 44

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Paper • 2510.20286 • Published 26 days ago • 23

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Paper • 2510.19871 • Published 27 days ago • 29

RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Paper • 2510.20479 • Published 26 days ago • 10

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Paper • 2510.13251 • Published Oct 15 • 12

Model Merging with Functional Dual Anchors

Paper • 2510.21223 • Published 25 days ago • 12

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Paper • 2510.20206 • Published 26 days ago • 11

A Definition of AGI

Paper • 2510.18212 • Published 28 days ago • 33

Visual Diffusion Models are Geometric Solvers

Paper • 2510.21697 • Published 24 days ago • 18

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Paper • 2510.21652 • Published 24 days ago • 3

ARC-Encoder: learning compressed text representations for large language models

Paper • 2510.20535 • Published 26 days ago • 5

Taming Modality Entanglement in Continual Audio-Visual Segmentation

Paper • 2510.17234 • Published 29 days ago • 3

MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4 • 155

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Paper • 2307.16789 • Published Jul 31, 2023 • 101

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Paper • 2304.08244 • Published Apr 14, 2023 • 1

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Paper • 2501.02506 • Published Jan 5 • 11

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Paper • 2207.01206 • Published Jul 4, 2022 • 3

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 241

Task Vectors are Cross-Modal

Paper • 2410.22330 • Published Oct 29, 2024 • 11

In-Context Learning Creates Task Vectors

Paper • 2310.15916 • Published Oct 24, 2023 • 43

Group Relative Attention Guidance for Image Editing

Paper • 2510.24657 • Published 20 days ago • 23

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Paper • 2510.24563 • Published 20 days ago • 22

WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Paper • 2510.24697 • Published 20 days ago • 20

BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Paper • 2510.10666 • Published Oct 12 • 27

WideSearch: Benchmarking Agentic Broad Info-Seeking

Paper • 2508.07999 • Published Aug 11 • 109

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Paper • 2506.01062 • Published Jun 1 • 5

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Paper • 2510.24711 • Published 20 days ago • 18

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Paper • 2510.22373 • Published 23 days ago • 14

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Paper • 2510.22264 • Published 24 days ago • 1

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published 18 days ago • 32

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Paper • 2510.26768 • Published 18 days ago • 33

The Era of Agentic Organization: Learning to Organize with Language Models

Paper • 2510.26658 • Published 18 days ago • 25

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Paper • 2510.26213 • Published 19 days ago • 9

Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets

Paper • 2510.25779 • Published 21 days ago • 9

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Paper • 2510.26160 • Published 19 days ago • 15

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Paper • 2510.26781 • Published 18 days ago

Emu3.5: Native Multimodal Models are World Learners

Paper • 2510.26583 • Published 18 days ago • 103

The End of Manual Decoding: Towards Truly End-to-End Language Models

Paper • 2510.26697 • Published 18 days ago • 113

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Paper • 2510.23473 • Published 21 days ago • 83

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Paper • 2510.23538 • Published 21 days ago • 95

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Paper • 2510.25726 • Published 19 days ago • 44

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Paper • 2510.25772 • Published 19 days ago • 32

The Principles of Diffusion Models

Paper • 2510.21890 • Published 25 days ago • 56

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Paper • 2510.25590 • Published 19 days ago • 25

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Paper • 2510.25760 • Published 19 days ago • 16

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Paper • 2510.25092 • Published 20 days ago • 7

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Paper • 2510.18672 • Published 28 days ago • 7

InteractComp: Evaluating Search Agents With Ambiguous Queries

Paper • 2510.24668 • Published 20 days ago • 96

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Paper • 2510.25602 • Published 19 days ago • 69

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Paper • 2510.27492 • Published 18 days ago • 79

Defeating the Training-Inference Mismatch via FP16

Paper • 2510.26788 • Published 18 days ago • 27

Revisiting Multimodal Positional Encoding in Vision-Language Models

Paper • 2510.23095 • Published 22 days ago • 20

Higher-order Linear Attention

Paper • 2510.27258 • Published 18 days ago • 11

The Denario project: Deep knowledge AI agents for scientific discovery

Paper • 2510.26887 • Published 18 days ago • 6

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Paper • 2511.01678 • Published 14 days ago • 33

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

Paper • 2510.22115 • Published 24 days ago • 81

The Underappreciated Power of Vision Models for Graph Structural Understanding

Paper • 2510.24788 • Published 22 days ago • 35

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Paper • 2511.01295 • Published 15 days ago • 37

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Paper • 2510.27363 • Published 18 days ago • 22

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Paper • 2511.01163 • Published 15 days ago • 31

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Paper • 2510.27571 • Published 17 days ago • 17

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Paper • 2511.01833 • Published 14 days ago • 15

LongCat-Flash-Omni Technical Report

Paper • 2511.00279 • Published 17 days ago • 21

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published 18 days ago • 11

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Paper • 2511.01618 • Published 15 days ago • 9

Trove: A Flexible Toolkit for Dense Retrieval

Paper • 2511.01857 • Published 14 days ago • 10

Towards Robust Mathematical Reasoning

Paper • 2511.01846 • Published 14 days ago • 7

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Paper • 2511.01266 • Published 15 days ago • 26

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Paper • 2511.00405 • Published 17 days ago • 5

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Paper • 2511.01617 • Published 15 days ago • 2

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Paper • 2511.02778 • Published 13 days ago • 100

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Paper • 2511.02779 • Published 13 days ago • 53

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Paper • 2511.02347 • Published 14 days ago • 8

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System

Paper • 2511.02832 • Published 13 days ago • 8

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Paper • 2511.02650 • Published 13 days ago • 9

CodeClash: Benchmarking Goal-Oriented Software Engineering

Paper • 2511.00839 • Published 16 days ago • 8

iFlyBot-VLA Technical Report

Paper • 2511.01914 • Published 17 days ago • 5

TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data

Paper • 2511.02219 • Published 14 days ago • 1

LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Paper • 2511.02366 • Published 14 days ago • 2

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Paper • 2511.02712 • Published 13 days ago • 2

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Paper • 2511.03146 • Published 13 days ago • 7

TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Paper • 2511.02802 • Published 13 days ago • 13

V-Thinker: Interactive Thinking with Images

Paper • 2511.04460 • Published 11 days ago • 94

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published 11 days ago • 189

Scaling Agent Learning via Experience Synthesis

Paper • 2511.03773 • Published 12 days ago • 75

NVIDIA Nemotron Nano V2 VL

Paper • 2511.03929 • Published 12 days ago • 26

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Paper • 2511.04307 • Published 12 days ago • 14

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Paper • 2511.04655 • Published 11 days ago • 7

Diffusion Language Models are Super Data Learners

Paper • 2511.03276 • Published 13 days ago • 114

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Paper • 2506.19676 • Published Jun 24

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Paper • 2509.09734 • Published Sep 10 • 15

DeepEyesV2: Toward Agentic Multimodal Model

Paper • 2511.05271 • Published 11 days ago • 38

Visual Spatial Tuning

Paper • 2511.05491 • Published 10 days ago • 46

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Paper • 2511.04962 • Published 11 days ago • 50

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Paper • 2511.05017 • Published 11 days ago • 7

Dense Motion Captioning

Paper • 2511.05369 • Published 10 days ago • 9

Real-Time Reasoning Agents in Evolving Environments

Paper • 2511.04898 • Published 11 days ago • 11