shankars
's Collections
AI-paper
updated
Describe What You See with Multimodal Large Language Models to Enhance
Video Recommendations
Paper
•
2508.09789
•
Published
•
5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
•
2508.13186
•
Published
•
18
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval
Driven LLM Agents
Paper
•
2508.04038
•
Published
•
1
Prompt Orchestration Markup Language
Paper
•
2508.13948
•
Published
•
48
MultiRef: Controllable Image Generation with Multiple Visual References
Paper
•
2508.06905
•
Published
•
21
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Paper
•
2508.14041
•
Published
•
59
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent
Distillation and Agentic RL
Paper
•
2508.13167
•
Published
•
127
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic
Thought Reward
Paper
•
2508.12800
•
Published
•
5
Copyright Protection for Large Language Models: A Survey of Methods,
Challenges, and Trends
Paper
•
2508.11548
•
Published
•
5
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Paper
•
2508.08777
•
Published
•
15
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion
Transformer
Paper
•
2508.09131
•
Published
•
16
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
•
2508.14704
•
Published
•
42
From AI for Science to Agentic Science: A Survey on Autonomous
Scientific Discovery
Paper
•
2508.14111
•
Published
•
33
RynnEC: Bringing MLLMs into Embodied World
Paper
•
2508.14160
•
Published
•
19
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
•
2505.04921
•
Published
•
185
Evolving Deeper LLM Thinking
Paper
•
2501.09891
•
Published
•
115
A Survey on Large Language Model Benchmarks
Paper
•
2508.15361
•
Published
•
20
Deep Think with Confidence
Paper
•
2508.15260
•
Published
•
88
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
•
2501.05452
•
Published
•
15
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
•
2504.15279
•
Published
•
77
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
Paper
•
2406.14562
•
Published
•
28
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
65
Thinking with Generated Images
Paper
•
2505.22525
•
Published
•
15
ChartMuseum: Testing Visual Reasoning Capabilities of Large
Vision-Language Models
Paper
•
2505.13444
•
Published
•
17
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
82
ComposeAnything: Composite Object Priors for Text-to-Image Generation
Paper
•
2505.24086
•
Published
•
5
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
•
2506.23918
•
Published
•
88
Visual Planning: Let's Think Only with Images
Paper
•
2505.11409
•
Published
•
57
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
47
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Paper
•
2403.12884
•
Published
•
1
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
Paper
•
2504.10090
•
Published
Visual Programming: Compositional visual reasoning without training
Paper
•
2211.11559
•
Published
•
1
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
9
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
•
2412.18072
•
Published
•
20
Intern-S1: A Scientific Multimodal Foundation Model
Paper
•
2508.15763
•
Published
•
256
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper
•
2504.06261
•
Published
•
110
Star Attention: Efficient LLM Inference over Long Sequences
Paper
•
2411.17116
•
Published
•
55
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
Home Clusters
Paper
•
2504.08791
•
Published
•
137
LLM Inference Unveiled: Survey and Roofline Model Insights
Paper
•
2402.16363
•
Published
•
4
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled
Architectures
Paper
•
2504.11750
•
Published
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
•
2410.11795
•
Published
•
18
Generative AI for Character Animation: A Comprehensive Survey of
Techniques, Applications, and Future Directions
Paper
•
2504.19056
•
Published
•
18
Personalized Image Generation with Deep Generative Models: A Decade
Survey
Paper
•
2502.13081
•
Published
Diffusion Models: A Comprehensive Survey of Methods and Applications
Paper
•
2209.00796
•
Published
An Empirical Study of GPT-4o Image Generation Capabilities
Paper
•
2504.05979
•
Published
•
64
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Paper
•
2502.09411
•
Published
•
22
A survey of Generative AI Applications
Paper
•
2306.02781
•
Published
Text-to-image Diffusion Models in Generative AI: A Survey
Paper
•
2303.07909
•
Published
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Paper
•
2501.06322
•
Published
•
1
Multi-Agent Collaboration via Evolving Orchestration
Paper
•
2505.19591
•
Published
•
1
GenMAC: Compositional Text-to-Video Generation with Multi-Agent
Collaboration
Paper
•
2412.04440
•
Published
•
22
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose
Task Solving
Paper
•
2506.12508
•
Published
•
1
Internet of Agents: Weaving a Web of Heterogeneous Agents for
Collaborative Intelligence
Paper
•
2407.07061
•
Published
•
27
VideoTetris: Towards Compositional Text-to-Video Generation
Paper
•
2406.04277
•
Published
•
25
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video
Generation
Paper
•
2407.14505
•
Published
•
26
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
•
2411.16657
•
Published
•
20
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
•
2411.10818
•
Published
•
26
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
•
2312.14125
•
Published
•
47
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
Paper
•
2504.03664
•
Published
FlexInfer: Breaking Memory Constraint via Flexible and Efficient
Offloading for On-Device LLM Inference
Paper
•
2503.03777
•
Published
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Paper
•
2503.16163
•
Published
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Paper
•
2502.12574
•
Published
•
12
Seesaw: High-throughput LLM Inference via Model Re-sharding
Paper
•
2503.06433
•
Published
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving
Under Resource Constraints
Paper
•
2504.09345
•
Published
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
•
2504.10479
•
Published
•
301
MV-RAG: Retrieval Augmented Multiview Diffusion
Paper
•
2508.16577
•
Published
•
38
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
•
2508.18032
•
Published
•
41
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent
LLMs
Paper
•
2508.17188
•
Published
•
17
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Paper
•
2508.17298
•
Published
•
4
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper
•
2508.16153
•
Published
•
154
AgentScope 1.0: A Developer-Centric Framework for Building Agentic
Applications
Paper
•
2508.16279
•
Published
•
52
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Paper
•
2508.15774
•
Published
•
20
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Paper
•
2508.19652
•
Published
•
84
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding
in Vision-Language-Action Policies
Paper
•
2508.20072
•
Published
•
31
AudioStory: Generating Long-Form Narrative Audio with Large Language
Models
Paper
•
2508.20088
•
Published
•
20
MotionFlux: Efficient Text-Guided Motion Generation through Rectified
Flow Matching and Preference Alignment
Paper
•
2508.19527
•
Published
•
10
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and
Disaggregated LLM Inference
Paper
•
2508.19559
•
Published
•
6
Mixture of Contexts for Long Video Generation
Paper
•
2508.21058
•
Published
•
35
rStar2-Agent: Agentic Reasoning Technical Report
Paper
•
2508.20722
•
Published
•
115
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
•
2508.20751
•
Published
•
89
AWorld: Orchestrating the Training Recipe for Agentic AI
Paper
•
2508.20404
•
Published
•
38
Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Paper
•
2508.21070
•
Published
•
6
ROSE: Remove Objects with Side Effects in Videos
Paper
•
2508.18633
•
Published
•
7
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
•
2508.21112
•
Published
•
75
A.S.E: A Repository-Level Benchmark for Evaluating Security in
AI-Generated Code
Paper
•
2508.18106
•
Published
•
344
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
•
2508.21113
•
Published
•
109
AHELM: A Holistic Evaluation of Audio-Language Models
Paper
•
2508.21376
•
Published
•
9
Morae: Proactively Pausing UI Agents for User Choices
Paper
•
2508.21456
•
Published
•
5
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
•
2508.21767
•
Published
•
12
Efficient Code Embeddings from Code Generation Models
Paper
•
2508.21290
•
Published
•
19
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model
Pre-training
Paper
•
2508.17677
•
Published
•
14
CLIPSym: Delving into Symmetry Detection with CLIP
Paper
•
2508.14197
•
Published
•
8
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
•
2508.21148
•
Published
•
139
Continual Learning for Large Language Models: A Survey
Paper
•
2402.01364
•
Published
•
1
Continual Learning with Pre-Trained Models: A Survey
Paper
•
2401.16386
•
Published
•
1
Continual Learning: Applications and the Road Forward
Paper
•
2311.11908
•
Published
•
1
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Paper
•
2509.02547
•
Published
•
224
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn
Tool-Integrated Reasoning
Paper
•
2509.02479
•
Published
•
83
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
•
2508.21496
•
Published
•
54
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Paper
•
2509.01055
•
Published
•
73
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
for Document Conversion
Paper
•
2509.01215
•
Published
•
50
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
•
2509.02460
•
Published
•
25
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
•
2509.01644
•
Published
•
33
Mixture of Global and Local Experts with Diffusion Transformer for
Controllable Face Generation
Paper
•
2509.00428
•
Published
•
17
From Editor to Dense Geometry Estimator
Paper
•
2509.04338
•
Published
•
91
Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from
Vector Drawings
Paper
•
2508.18733
•
Published
•
9
Towards a Unified View of Large Language Model Post-Training
Paper
•
2509.04419
•
Published
•
73
RedStone: Curating General, Code, Math, and QA Data for Large Language
Models
Paper
•
2412.03398
•
Published
•
2
RecAgent: A Novel Simulation Paradigm for Recommender Systems
Paper
•
2306.02552
•
Published
•
1
Adversarial Data Collection: Human-Collaborative Perturbations for
Efficient and Robust Robotic Imitation Learning
Paper
•
2503.11646
•
Published
•
35
How do language models learn facts? Dynamics, curricula and
hallucinations
Paper
•
2503.21676
•
Published
•
1
Investigating Multi-source Active Learning for Natural Language
Inference
Paper
•
2302.06976
•
Published
Targeted Data Acquisition for Evolving Negotiation Agents
Paper
•
2106.07728
•
Published
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Paper
•
2509.06155
•
Published
•
13
Revolutionizing Reinforcement Learning Framework for Diffusion Large
Language Models
Paper
•
2509.06949
•
Published
•
56
Reinforced Visual Perception with Tools
Paper
•
2509.01656
•
Published
•
31
Reinforcement Learning Foundations for Deep Research Systems: A Survey
Paper
•
2509.06733
•
Published
•
32
Visual Representation Alignment for Multimodal Large Language Models
Paper
•
2509.07979
•
Published
•
83
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
•
2509.06951
•
Published
•
31
A Survey of Reinforcement Learning for Large Reasoning Models
Paper
•
2509.08827
•
Published
•
188
EnvX: Agentize Everything with Agentic AI
Paper
•
2509.08088
•
Published
•
8
HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI
Assistants
Paper
•
2509.08494
•
Published
•
1
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action
Model
Paper
•
2509.09372
•
Published
•
236
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
•
2509.08519
•
Published
•
127
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Paper
•
2509.09674
•
Published
•
79
Kling-Avatar: Grounding Multimodal Instructions for Cascaded
Long-Duration Avatar Animation Synthesis
Paper
•
2509.09595
•
Published
•
48
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper
•
2509.09676
•
Published
•
31
Visual Programmability: A Guide for Code-as-Thought in Chart
Understanding
Paper
•
2509.09286
•
Published
•
11
Agentic Software Engineering: Foundational Pillars and a Research
Roadmap
Paper
•
2509.06216
•
Published
•
7
AI Agentic Programming: A Survey of Techniques, Challenges, and
Opportunities
Paper
•
2508.11126
•
Published
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges
Paper
•
2508.10146
•
Published
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question
Answering with LLMs
Paper
•
2509.15020
•
Published
•
4
Developer-LLM Conversations: An Empirical Study of Interactions and
Generated Code Quality
Paper
•
2509.10402
•
Published
•
5
Unleashing the Potential of Multimodal LLMs for Zero-Shot
Spatio-Temporal Video Grounding
Paper
•
2509.15178
•
Published
•
6
RecoWorld: Building Simulated Environments for Agentic Recommender
Systems
Paper
•
2509.10397
•
Published
•
7
MultiEdit: Advancing Instruction-based Image Editing on Diverse and
Challenging Tasks
Paper
•
2509.14638
•
Published
•
11
AToken: A Unified Tokenizer for Vision
Paper
•
2509.14476
•
Published
•
36
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial
Search and Reasoning
Paper
•
2509.13160
•
Published
•
29
Understand Before You Generate: Self-Guided Training for Autoregressive
Image Generation
Paper
•
2509.15185
•
Published
•
29
Evolving Language Models without Labels: Majority Drives Selection,
Novelty Promotes Variation
Paper
•
2509.15194
•
Published
•
33
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform
Data
Paper
•
2509.15221
•
Published
•
109
FlowRL: Matching Reward Distributions for LLM Reasoning
Paper
•
2509.15207
•
Published
•
113
Reasoning over Boundaries: Enhancing Specification Alignment via
Test-time Delibration
Paper
•
2509.14760
•
Published
•
52
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
•
2509.16197
•
Published
•
54
Latent Zoning Network: A Unified Principle for Generative Modeling,
Representation Learning, and Classification
Paper
•
2509.15591
•
Published
•
45
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
•
2509.15496
•
Published
•
12
LIMI: Less is More for Agency
Paper
•
2509.17567
•
Published
•
100
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
•
2509.17627
•
Published
•
66
Qwen3-Omni Technical Report
Paper
•
2509.17765
•
Published
•
135
OnePiece: Bringing Context Engineering and Reasoning to Industrial
Cascade Ranking System
Paper
•
2509.18091
•
Published
•
33
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning
for Video LLMs
Paper
•
2509.18056
•
Published
•
27
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric
Reasoning
Paper
•
2509.17437
•
Published
•
17
EpiCache: Episodic KV Cache Management for Long Conversational Question
Answering
Paper
•
2509.17396
•
Published
•
19
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering
Tasks?
Paper
•
2509.16941
•
Published
•
21
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning
Models on Automatically Verifiable Textual and Visual Questions
Paper
•
2509.17177
•
Published
•
13
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from
Token and Parameter Levels
Paper
•
2509.16596
•
Published
•
14
Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
Paper
•
2509.18083
•
Published
•
5
Understanding Embedding Scaling in Collaborative Filtering
Paper
•
2509.15709
•
Published
•
5
ContextFlow: Training-Free Video Object Editing via Adaptive Context
Enrichment
Paper
•
2509.17818
•
Published
•
8
AuditoryBench++: Can Language Models Understand Auditory Knowledge
without Hearing?
Paper
•
2509.17641
•
Published
•
4
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for
India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian
Context
Paper
•
2509.17399
•
Published
•
2
When Big Models Train Small Ones: Label-Free Model Parity Alignment for
Efficient Visual Question Answering using Small VLMs
Paper
•
2509.16633
•
Published
•
2
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
•
2509.18154
•
Published
•
50
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
•
2509.18824
•
Published
•
22
What Characterizes Effective Reasoning? Revisiting Length, Review, and
Structure of CoT
Paper
•
2509.19284
•
Published
•
22
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via
Travel Video Itinerary Reconstruction
Paper
•
2509.19002
•
Published
•
2
Video models are zero-shot learners and reasoners
Paper
•
2509.20328
•
Published
•
96
SIM-CoT: Supervised Implicit Chain-of-Thought
Paper
•
2509.20317
•
Published
•
41
EmbeddingGemma: Powerful and Lightweight Text Representations
Paper
•
2509.20354
•
Published
•
39
EditVerse: Unifying Image and Video Editing and Generation with
In-Context Learning
Paper
•
2509.20360
•
Published
•
17
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video
Generation
Paper
•
2509.20358
•
Published
•
14
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
•
2509.19244
•
Published
•
11
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just
What They Say
Paper
•
2509.21164
•
Published
•
8
VCRL: Variance-based Curriculum Reinforcement Learning for Large
Language Models
Paper
•
2509.19803
•
Published
•
118
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Paper
•
2509.21320
•
Published
•
99
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and
Open Resources
Paper
•
2509.21268
•
Published
•
101
Tree Search for LLM Agent Reinforcement Learning
Paper
•
2509.21240
•
Published
•
87
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper
•
2509.20427
•
Published
•
77
AutoIntent: AutoML for Text Classification
Paper
•
2509.21138
•
Published
•
35
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Paper
•
2509.21117
•
Published
•
29
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web
Reconnaissance, Tool Generation, and Task Execution
Paper
•
2509.21072
•
Published
•
15
Does FLUX Already Know How to Perform Physically Plausible Image
Composition?
Paper
•
2509.21278
•
Published
•
15
Thinking Augmented Pre-training
Paper
•
2509.20186
•
Published
•
23
Understanding the Thinking Process of Reasoning Models: A Perspective
from Schoenfeld's Episode Theory
Paper
•
2509.14662
•
Published
•
13
SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Paper
•
2509.21318
•
Published
•
10
Interactive Recommendation Agent with Active User Commands
Paper
•
2509.21317
•
Published
•
6
UserRL: Training Interactive User-Centric Agent via Reinforcement
Learning
Paper
•
2509.19736
•
Published
•
11
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for
Video Temporal Reasoning
Paper
•
2509.21113
•
Published
•
5
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and
Self-Reflective Agent
Paper
•
2509.20414
•
Published
•
9
Thinking While Listening: Simple Test Time Scaling For Audio
Classification
Paper
•
2509.19676
•
Published
•
4
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks
Silently Undermine Validity
Paper
•
2509.20293
•
Published
•
7
Discrete Diffusion for Reflective Vision-Language-Action Models in
Autonomous Driving
Paper
•
2509.20109
•
Published
•
3
CompLLM: Compression for Long Context Q&A
Paper
•
2509.19228
•
Published
•
8
Blueprints of Trust: AI System Cards for End to End Transparency and
Governance
Paper
•
2509.20394
•
Published
•
2
StyleBench: Evaluating thinking styles in Large Language Models
Paper
•
2509.20868
•
Published
•
3
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense
Overlaps
Paper
•
2509.19282
•
Published
•
7
LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale
Diffusion Transformer
Paper
•
2509.22414
•
Published
•
21
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Paper
•
2509.21760
•
Published
•
14
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening,
Speaking, and Viewing
Paper
•
2509.22651
•
Published
•
22
Variational Reasoning for Language Models
Paper
•
2509.22637
•
Published
•
68
LongLive: Real-time Interactive Long Video Generation
Paper
•
2509.22622
•
Published
•
182
A Survey of Interactive Generative Video
Paper
•
2504.21853
•
Published
•
46
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
•
2402.17753
•
Published
•
20
VBench: Comprehensive Benchmark Suite for Video Generative Models
Paper
•
2311.17982
•
Published
•
9
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic
Faithfulness
Paper
•
2503.21755
•
Published
•
33
VBench++: Comprehensive and Versatile Benchmark Suite for Video
Generative Models
Paper
•
2411.13503
•
Published
•
34
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
•
2406.16855
•
Published
•
57
VCBench: Benchmarking LLMs in Venture Capital
Paper
•
2509.14448
•
Published
AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Paper
•
2504.20865
•
Published
ConsumerBench: Benchmarking Generative AI Applications on End-User
Devices
Paper
•
2506.17538
•
Published
•
7
Benchmarking AI Models in Software Engineering: A Review, Search Tool,
and Enhancement Protocol
Paper
•
2503.05860
•
Published
•
11
MERA Code: A Unified Framework for Evaluating Code Generation Across
Tasks
Paper
•
2507.12284
•
Published
•
1
Benchmarking Neural Network Training Algorithms
Paper
•
2306.07179
•
Published
•
23
SpreadsheetBench: Towards Challenging Real World Spreadsheet
Manipulation
Paper
•
2406.14991
•
Published
•
2
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM
Evaluation
Paper
•
2506.00482
•
Published
•
8
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls
and Complex Instructions
Paper
•
2406.15877
•
Published
•
48
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
•
2407.18961
•
Published
•
40
ImgEdit: A Unified Image Editing Dataset and Benchmark
Paper
•
2505.20275
•
Published
•
18
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
•
2504.02782
•
Published
•
57
7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Paper
•
2508.12919
•
Published
•
1
Instruction-Following Evaluation in Function Calling for Large Language
Models
Paper
•
2509.18420
•
Published
•
1
MinerU2.5: A Decoupled Vision-Language Model for Efficient
High-Resolution Document Parsing
Paper
•
2509.22186
•
Published
•
132
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient
SpeechLLMs
Paper
•
2509.22220
•
Published
•
64
RealUnify: Do Unified Models Truly Benefit from Unification? A
Comprehensive Benchmark
Paper
•
2509.24897
•
Published
•
46
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
•
2509.24900
•
Published
•
53
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
•
2505.09568
•
Published
•
98
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
•
2505.02567
•
Published
•
80
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for
Large Language Models
Paper
•
2406.12644
•
Published
•
5
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing
via Compositional Dependencies
Paper
•
2506.12830
•
Published
CompBench: Benchmarking Complex Instruction-guided Image Editing
Paper
•
2505.12200
•
Published
Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought
Imagination
Paper
•
2509.01986
•
Published
•
4
GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
Paper
•
2310.11513
•
Published
•
1
Visual Jigsaw Post-Training Improves MLLMs
Paper
•
2509.25190
•
Published
•
35
SANA-Video: Efficient Video Generation with Block Linear Diffusion
Transformer
Paper
•
2509.24695
•
Published
•
44
Democratizing AI scientists using ToolUniverse
Paper
•
2509.23426
•
Published
•
39
EasySteer: A Unified Framework for High-Performance and Extensible LLM
Steering
Paper
•
2509.25175
•
Published
•
29
Towards Personalized Deep Research: Benchmarks and Evaluations
Paper
•
2509.25106
•
Published
•
28
VideoScore2: Think before You Score in Generative Video Evaluation
Paper
•
2509.22799
•
Published
•
24
MMPB: It's Time for Multi-Modal Personalization
Paper
•
2509.22820
•
Published
•
14
Personalization of Large Language Models: A Survey
Paper
•
2411.00027
•
Published
•
33
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Paper
•
2509.25161
•
Published
•
23
HunyuanImage 3.0 Technical Report
Paper
•
2509.23951
•
Published
•
21
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on
Structured Images
Paper
•
2509.25185
•
Published
•
4
Local Success Does Not Compose: Benchmarking Large Language Models for
Compositional Formal Verification
Paper
•
2509.23061
•
Published
•
6
UniVid: The Open-Source Unified Video Model
Paper
•
2509.24200
•
Published
•
4
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation
Paper
•
2509.23338
•
Published
•
4
BPMN Assistant: An LLM-Based Approach to Business Process Modeling
Paper
•
2509.24592
•
Published
•
1
Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large
Language Models
Paper
•
2509.23233
•
Published
•
2
Advancing Reference-free Evaluation of Video Captions with Factual
Analysis
Paper
•
2509.16538
•
Published
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP
Use
Paper
•
2509.24002
•
Published
•
171
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Paper
•
2509.26536
•
Published
•
34
DC-VideoGen: Efficient Video Generation with Deep Compression Video
Autoencoder
Paper
•
2509.25182
•
Published
•
36
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
•
2509.26625
•
Published
•
43
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in
Real-world Applications
Paper
•
2509.26490
•
Published
•
19
dParallel: Learnable Parallel Decoding for dLLMs
Paper
•
2509.26488
•
Published
•
19
DA^2: Depth Anything in Any Direction
Paper
•
2509.26618
•
Published
•
25
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
Paper
•
2509.26329
•
Published
•
2
Video Object Segmentation-Aware Audio Generation
Paper
•
2509.26604
•
Published
•
1
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source
Software
Paper
•
2509.25248
•
Published
•
2
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional
Video Generation
Paper
•
2509.26555
•
Published
Regression Language Models for Code
Paper
•
2509.26476
•
Published
•
16
The Pitfalls of KV Cache Compression
Paper
•
2510.00231
•
Published
•
5
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Paper
•
2509.26539
•
Published
•
8
LayerD: Decomposing Raster Graphic Designs into Layers
Paper
•
2509.25134
•
Published
•
1
Improving Editability in Image Generation with Layer-wise Memory
Paper
•
2505.01079
•
Published
•
29
Generative Image Layer Decomposition with Visual Effects
Paper
•
2411.17864
•
Published
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper
•
2503.13327
•
Published
•
29
Text2Layer: Layered Image Generation using Latent Diffusion Model
Paper
•
2307.09781
•
Published
•
15
Code2Video: A Code-centric Paradigm for Educational Video Generation
Paper
•
2510.01174
•
Published
•
33
GEM: A Gym for Agentic LLMs
Paper
•
2510.01051
•
Published
•
88
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model
Responses
Paper
•
2510.00232
•
Published
•
15
In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn
Reasoning
Paper
•
2510.00777
•
Published
•
2
An Empirical Study of Testing Practices in Open Source AI Agent
Frameworks and Agentic Applications
Paper
•
2509.19185
•
Published
•
3
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Paper
•
2402.11281
•
Published
•
1
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
Paper
•
2509.25162
•
Published
•
3
BindWeave: Subject-Consistent Video Generation via Cross-Modal
Integration
Paper
•
2510.00438
•
Published
•
7
BatonVoice: An Operationalist Framework for Enhancing Controllable
Speech Synthesis with Linguistic Intelligence from LLMs
Paper
•
2509.26514
•
Published
•
3
Eliciting Secret Knowledge from Language Models
Paper
•
2510.01070
•
Published
•
4
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
•
2510.02283
•
Published
•
92
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world
Markets?
Paper
•
2510.02209
•
Published
•
52
BloombergGPT: A Large Language Model for Finance
Paper
•
2303.17564
•
Published
•
27
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
•
2510.01284
•
Published
•
32
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research
Agents: From Answers to Reports
Paper
•
2510.02190
•
Published
•
18
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Paper
•
2510.00523
•
Published
•
6
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
•
2504.17432
•
Published
•
39
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
39
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
10
CoDA: Agentic Systems for Collaborative Data Visualization
Paper
•
2510.03194
•
Published
•
28
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?
Paper
•
2510.03120
•
Published
•
6
Paper2Video: Automatic Video Generation from Scientific Papers
Paper
•
2510.05096
•
Published
•
111
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
•
2510.05094
•
Published
•
36
Agentic Context Engineering: Evolving Contexts for Self-Improving
Language Models
Paper
•
2510.04618
•
Published
•
120
Hybrid Architectures for Language Models: Systematic Analysis and Design
Insights
Paper
•
2510.04800
•
Published
•
36
Cache-to-Cache: Direct Semantic Communication Between Large Language
Models
Paper
•
2510.03215
•
Published
•
96
Ming-UniVision: Joint Image Understanding and Generation with a Unified
Continuous Tokenizer
Paper
•
2510.06590
•
Published
•
70
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal
Generation and Understanding
Paper
•
2510.06308
•
Published
•
53
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper
•
2510.06917
•
Published
•
34
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with
Holistic Platform and Adaptive Hybrid Policy Optimization
Paper
•
2510.08540
•
Published
•
108
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Paper
•
2510.07310
•
Published
•
35
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Paper
•
2510.06710
•
Published
•
38
Vibe Checker: Aligning Code Evaluation with Human Preference
Paper
•
2510.07315
•
Published
•
32
Online Generic Event Boundary Detection
Paper
•
2510.06855
•
Published
•
3
Bridging Text and Video Generation: A Survey
Paper
•
2510.04999
•
Published
•
3
U-Bench: A Comprehensive Understanding of U-Net through 100-Variant
Benchmarking
Paper
•
2510.07041
•
Published
•
3
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for
Autonomous Travel Planning Agents
Paper
•
2509.21842
•
Published
•
2
Agent Learning via Early Experience
Paper
•
2510.08558
•
Published
•
262
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
•
2510.08377
•
Published
•
70
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video
Super-Resolution
Paper
•
2510.08143
•
Published
•
20
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Paper
•
2510.03663
•
Published
•
15
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM
Agents
Paper
•
2510.07172
•
Published
•
28
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal
Patches via In-Context Conditioning
Paper
•
2510.08555
•
Published
•
62
Recycling Pretrained Checkpoints: Orthogonal Growth of
Mixture-of-Experts for Efficient Large Language Model Pre-Training
Paper
•
2510.08008
•
Published
•
5
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs
Paper
•
2510.07429
•
Published
•
3
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context
Window
Paper
•
2510.08276
•
Published
•
9
SciVideoBench: Benchmarking Scientific Video Reasoning in Large
Multimodal Models
Paper
•
2510.08559
•
Published
•
8
Character Mixing for Video Generation
Paper
•
2510.05093
•
Published
•
6
WithAnyone: Towards Controllable and ID Consistent Image Generation
Paper
•
2510.14975
•
Published
•
80
From Pixels to Words -- Towards Native Vision-Language Primitives at
Scale
Paper
•
2510.14979
•
Published
•
65
Attention Is All You Need for KV Cache in Diffusion LLMs
Paper
•
2510.14973
•
Published
•
38
LLM-guided Hierarchical Retrieval
Paper
•
2510.13217
•
Published
•
16
Qwen3Guard Technical Report
Paper
•
2510.14276
•
Published
•
13
Learning an Image Editing Model without Image Editing Pairs
Paper
•
2510.14978
•
Published
•
7
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Paper
•
2510.14974
•
Published
•
9
RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval
Augmented Generation Systems
Paper
•
2510.13910
•
Published
•
1
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Paper
•
2510.21618
•
Published
•
95
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper
•
2510.20888
•
Published
•
44
UI-Ins: Enhancing GUI Grounding with Multi-Perspective
Instruction-as-Reasoning
Paper
•
2510.20286
•
Published
•
23
From Denoising to Refining: A Corrective Framework for Vision-Language
Diffusion Model
Paper
•
2510.19871
•
Published
•
29
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging
Paper
•
2510.20479
•
Published
•
10
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Paper
•
2510.13251
•
Published
•
12
Model Merging with Functional Dual Anchors
Paper
•
2510.21223
•
Published
•
12
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via
Data Alignment and Test-Time Scaling
Paper
•
2510.20206
•
Published
•
11
Paper
•
2510.18212
•
Published
•
33
Visual Diffusion Models are Geometric Solvers
Paper
•
2510.21697
•
Published
•
18
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research
Suite
Paper
•
2510.21652
•
Published
•
3
ARC-Encoder: learning compressed text representations for large language
models
Paper
•
2510.20535
•
Published
•
5
Taming Modality Entanglement in Continual Audio-Visual Segmentation
Paper
•
2510.17234
•
Published
•
3
MemOS: A Memory OS for AI System
Paper
•
2507.03724
•
Published
•
155
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs
Paper
•
2307.16789
•
Published
•
101
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Paper
•
2304.08244
•
Published
•
1
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models
in Multi-Hop Tool Use
Paper
•
2501.02506
•
Published
•
11
WebShop: Towards Scalable Real-World Web Interaction with Grounded
Language Agents
Paper
•
2207.01206
•
Published
•
3
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
241
Task Vectors are Cross-Modal
Paper
•
2410.22330
•
Published
•
11
In-Context Learning Creates Task Vectors
Paper
•
2310.15916
•
Published
•
43
Group Relative Attention Guidance for Image Editing
Paper
•
2510.24657
•
Published
•
23
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Paper
•
2510.24563
•
Published
•
22
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling
Info-Rich Seeking
Paper
•
2510.24697
•
Published
•
20
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing
Actions
Paper
•
2510.10666
•
Published
•
27
WideSearch: Benchmarking Agentic Broad Info-Seeking
Paper
•
2508.07999
•
Published
•
109
SealQA: Raising the Bar for Reasoning in Search-Augmented Language
Models
Paper
•
2506.01062
•
Published
•
5
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit
Routing Guidance
Paper
•
2510.24711
•
Published
•
18
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Paper
•
2510.22373
•
Published
•
14
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text
Embedding
Paper
•
2510.22264
•
Published
•
1
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
•
2510.26802
•
Published
•
32
AMO-Bench: Large Language Models Still Struggle in High School Math
Competitions
Paper
•
2510.26768
•
Published
•
33
The Era of Agentic Organization: Learning to Organize with Language
Models
Paper
•
2510.26658
•
Published
•
25
OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal
Document Layout Generation
Paper
•
2510.26213
•
Published
•
9
Magentic Marketplace: An Open-Source Environment for Studying Agentic
Markets
Paper
•
2510.25779
•
Published
•
9
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Paper
•
2510.26160
•
Published
•
15
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Paper
•
2510.26781
•
Published
Emu3.5: Native Multimodal Models are World Learners
Paper
•
2510.26583
•
Published
•
103
The End of Manual Decoding: Towards Truly End-to-End Language Models
Paper
•
2510.26697
•
Published
•
113
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
•
2510.23473
•
Published
•
83
JanusCoder: Towards a Foundational Visual-Programmatic Interface for
Code Intelligence
Paper
•
2510.23538
•
Published
•
95
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic,
and Long-Horizon Task Execution
Paper
•
2510.25726
•
Published
•
44
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context
Learning
Paper
•
2510.25772
•
Published
•
32
The Principles of Diffusion Models
Paper
•
2510.21890
•
Published
•
56
RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Paper
•
2510.25590
•
Published
•
25
Multimodal Spatial Reasoning in the Large Model Era: A Survey and
Benchmarks
Paper
•
2510.25760
•
Published
•
16
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In
Text-only LLMs
Paper
•
2510.25092
•
Published
•
7
Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Paper
•
2510.18672
•
Published
•
7
InteractComp: Evaluating Search Agents With Ambiguous Queries
Paper
•
2510.24668
•
Published
•
96
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization
Formats
Paper
•
2510.25602
•
Published
•
69
ThinkMorph: Emergent Properties in Multimodal Interleaved
Chain-of-Thought Reasoning
Paper
•
2510.27492
•
Published
•
79
Defeating the Training-Inference Mismatch via FP16
Paper
•
2510.26788
•
Published
•
27
Revisiting Multimodal Positional Encoding in Vision-Language Models
Paper
•
2510.23095
•
Published
•
20
Higher-order Linear Attention
Paper
•
2510.27258
•
Published
•
11
The Denario project: Deep knowledge AI agents for scientific discovery
Paper
•
2510.26887
•
Published
•
6
UniLumos: Fast and Unified Image and Video Relighting with
Physics-Plausible Feedback
Paper
•
2511.01678
•
Published
•
33
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open
Language Foundation
Paper
•
2510.22115
•
Published
•
81
The Underappreciated Power of Vision Models for Graph Structural
Understanding
Paper
•
2510.24788
•
Published
•
35
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Paper
•
2511.01295
•
Published
•
37
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool
Use
Paper
•
2510.27363
•
Published
•
22
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal
Generation
Paper
•
2511.01163
•
Published
•
31
Towards Universal Video Retrieval: Generalizing Video Embedding via
Synthesized Multimodal Pyramid Curriculum
Paper
•
2510.27571
•
Published
•
17
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images
Reasoning
Paper
•
2511.01833
•
Published
•
15
LongCat-Flash-Omni Technical Report
Paper
•
2511.00279
•
Published
•
21
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement
Reading with MeasureBench
Paper
•
2510.26865
•
Published
•
11
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language
Models
Paper
•
2511.01618
•
Published
•
9
Trove: A Flexible Toolkit for Dense Retrieval
Paper
•
2511.01857
•
Published
•
10
Towards Robust Mathematical Reasoning
Paper
•
2511.01846
•
Published
•
7
MotionStream: Real-Time Video Generation with Interactive Motion
Controls
Paper
•
2511.01266
•
Published
•
26
UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
Paper
•
2511.00405
•
Published
•
5
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Paper
•
2511.01617
•
Published
•
2
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation
Paper
•
2511.02778
•
Published
•
100
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
•
2511.02779
•
Published
•
53
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Paper
•
2511.02347
•
Published
•
8
TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Paper
•
2511.02832
•
Published
•
8
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for
Large Multimodal Models
Paper
•
2511.02650
•
Published
•
9
CodeClash: Benchmarking Goal-Oriented Software Engineering
Paper
•
2511.00839
•
Published
•
8
iFlyBot-VLA Technical Report
Paper
•
2511.01914
•
Published
•
5
TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning
in Tabular Data
Paper
•
2511.02219
•
Published
•
1
LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for
LLMs in Chinese Context
Paper
•
2511.02366
•
Published
•
2
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation
Models
Paper
•
2511.02712
•
Published
•
2
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive
Capacity
Paper
•
2511.03146
•
Published
•
7
TabTune: A Unified Library for Inference and Fine-Tuning Tabular
Foundation Models
Paper
•
2511.02802
•
Published
•
13
V-Thinker: Interactive Thinking with Images
Paper
•
2511.04460
•
Published
•
94
Thinking with Video: Video Generation as a Promising Multimodal
Reasoning Paradigm
Paper
•
2511.04570
•
Published
•
189
Scaling Agent Learning via Experience Synthesis
Paper
•
2511.03773
•
Published
•
75
NVIDIA Nemotron Nano V2 VL
Paper
•
2511.03929
•
Published
•
26
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Paper
•
2511.04307
•
Published
•
14
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable
Non-Visual Shortcuts
Paper
•
2511.04655
•
Published
•
7
Diffusion Language Models are Super Data Learners
Paper
•
2511.03276
•
Published
•
114
A Survey of LLM-Driven AI Agent Communication: Protocols, Security
Risks, and Defense Countermeasures
Paper
•
2506.19676
•
Published
MCP-AgentBench: Evaluating Real-World Language Agent Performance with
MCP-Mediated Tools
Paper
•
2509.09734
•
Published
•
15
DeepEyesV2: Toward Agentic Multimodal Model
Paper
•
2511.05271
•
Published
•
38
Paper
•
2511.05491
•
Published
•
46
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper
•
2511.04962
•
Published
•
50
Towards Mitigating Hallucinations in Large Vision-Language Models by
Refining Textual Embeddings
Paper
•
2511.05017
•
Published
•
7
Paper
•
2511.05369
•
Published
•
9
Real-Time Reasoning Agents in Evolving Environments
Paper
•
2511.04898
•
Published
•
11