The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization
Abstract
Audio geo-localization benchmark AGL1K is introduced to advance audio language models' geospatial reasoning capabilities through curated audio clips and evaluation across multiple models.
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
Community
We found the sonar moment in audio language models. We propose the task of audio geo-localization. And amazingly, Gemini 3 Pro can reach the distance error of less than 55km for 25% samples.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach (2026)
- GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models (2025)
- Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention (2025)
- GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents (2025)
- GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization (2025)
- MapTrace: Scalable Data Generation for Route Tracing on Maps (2025)
- ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Here are a few concerns I had after reading the paper.
First, the motivation feels a bit thin: it’s not obvious how often we truly need to infer location from audio alone in real-world settings, especially when many plausible use cases would typically rely on richer signals (video/images, timestamps, device metadata, or surrounding context).
Second, the dataset construction may introduce strong sampling bias. Since the benchmark is built from user-uploaded clips on Aporee, it likely over-represents travel/landmark-style recordings and soundscapes with strong linguistic cues, rather than a distribution that resembles everyday environments.
Third, the scale is quite small for a “global” claim (about 1.4K clips across 72 countries/regions, with clear geographic imbalance). With this size, it’s hard to conclude models have a generally reliable audio geo-localization capability; the results could mostly reflect success on a limited set of highly localizable or otherwise “representative” locations.
Finally, since the audio originates from a public online source, it’s plausible that parts of this corpus (or close variants) were already present in some models’ pretraining data. If so, strong performance might reflect memorization or retrieval of seen content rather than genuine audio-based reasoning.
Thank you for the thoughtful and substantive feedback. Below, we respond to each concern in turn and clarify both the motivation and the scope of our work.
(1) On the motivation and real-world relevance of audio-only geo-localization.
We agree that many real-world systems can leverage richer multimodal signals (e.g., video, metadata, timestamps). However, our focus on audio-only geo-localization is motivated by settings where such auxiliary information is unavailable, unreliable, or intentionally absent. A concrete and practically important example arises in public safety and emergency response. In many regions, local authorities routinely receive a large number of anonymous or spoofed phone calls (e.g., bomb threats or false alarms), where only raw audio is available. In some large metropolitan areas, such calls can occur multiple times per day. The ability to infer coarse-grained geographic plausibility from audio alone can help prioritize responses and filter clearly inconsistent or malicious reports, yielding tangible safety value. More broadly, spoken communication remains a dominant medium in everyday life (phone calls, voice messages, emergency hotlines), making audio-only inference a relevant and underexplored capability. Our goal is therefore not to replace multimodal systems, but to characterize and benchmark what is possible and not possible when audio is the sole signal.
(2) On potential sampling bias in the dataset construction.
The reviewer is correct that Aporee, as a user-uploaded platform, does not represent a random sample of all acoustic environments. We took several steps to mitigate this bias. First, during curation, we explicitly enforced a balance between clips with and without spoken language (50/50), reducing over-reliance on linguistic cues. Second, in the human filtering stage, annotators were instructed to preferentially retain clips reflecting everyday environments (e.g., streets, transportation, residential surroundings) rather than only iconic landmarks or touristic recordings. Importantly, while we initially collected over 30,000 candidate clips, the final ~1.4K samples represent a deliberately high-quality subset that meets strict criteria for audio clarity, geographic validity, and localizability. In this sense, the benchmark is not intended to mirror the raw Aporee distribution, but to isolate a controlled set of audio-location pairs suitable for studying model behavior.
(3) On the dataset scale and the “global” claim.
We acknowledge that the absolute scale of the benchmark is modest relative to large vision or language datasets, and that geographic imbalance remains. Our use of the term “global” is intended to indicate geographic coverage (72 countries/regions across multiple continents), rather than statistical representativeness of all environments worldwide. We agree that the current scale does not support claims of fully reliable, universal audio geo-localization. Instead, our results should be interpreted as evidence that modern audio-language models exhibit non-trivial, uneven, and fragile geo-inference capabilities under constrained conditions. We view this dataset as a first diagnostic benchmark rather than a definitive measure, and we explicitly position scaling, rebalancing, and broader environment coverage as key future work.
(4) On potential contamination from model pretraining data.
This is an important concern. We believe the risk of direct memorization is limited for several reasons. Aporee enforces strict usage policies that restrict automated scraping by AI systems, and, critically, even if audio files were accessed, obtaining reliable audio–location metadata pairings at scale is non-trivial. This lack of existing benchmarks is precisely why we constructed the dataset. Nevertheless, we do not claim to fully rule out partial exposure. Rather than assuming a clean separation, we frame our benchmark as measuring effective capability under realistic training conditions, which may include incidental exposure to similar content. Distinguishing memorization from genuine reasoning is an important open problem, and our error analyses (e.g., failures on acoustically similar but geographically distinct locations) suggest that models are far from exhibiting simple lookup-based behavior.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper