Multi-modal Large Language Models (MLLMs) can read entire documents, but can they find hidden, precise details buried in complex layouts? NiM-Benchmark introduces tough new VQA challenges: fine-grained, localized visual reasoning, like spotting menu prices or a tiny table insight on a research paper. Spot-IT delivers an intuitive approach that helps MLLMs "zoom in" — intelligently picking image patches and applying a dynamic Gaussian focus guided by the user's query. Large-scale evaluation sets new accuracy records and reveals just how far these models still lag behind human readers, especially for tiny, context-heavy details.
Vision-Language Models are bridging the gap between seeing and understanding, but their real-world utility hinges on precision: can they find just the answer—no more, no less—when it occupies only a sliver of a dense document? Most DocVQA benchmarks miss this, focusing on broad summaries or generic extraction, not the real-world pain-point of hidden, context-rich queries.
NiM-Benchmark is crafted for precision stress-testing, with 2,970 images and 1,180 question-answer pairs, from everyday sources:
Each question targets a region covering <5% of the visual area. Domain variety and question complexity is ensured (see paper Table 9).
Domain | Sample Types |
---|---|
Restaurant Menus | Prices, dish ingredients, dietary info |
Academic Papers | Figure/table values, technical parameters |
Magazines / Newspapers | Date, events, entities, scores |
Websites / Lectures | UI details, content, nav elements |
Spot-IT successfully identifies and focuses on specific menu items and prices within complex restaurant menu layouts. The system can precisely locate fine details like individual dish prices, ingredients, and dietary information that occupy small regions of the overall document.
Academic content presents unique challenges with dense information layouts. Spot-IT demonstrates its ability to extract fine-grained details from lecture slides, focusing on specific data points, formulas, or textual elements that would be difficult to identify without targeted attention mechanisms.
Web interfaces contain numerous small interactive elements and textual details. This example showcases how Spot-IT can pinpoint specific UI components, navigation elements, or content sections within complex website layouts, enabling precise information extraction from digital interfaces.
Spot-IT was tested on open-source and leading commercial MLLMs. Highlights:
Model | NiM EM (%) | NiM F1 (%) | Human EM (%) |
---|---|---|---|
GPT-4o | 38 | 48 | 63 |
GPT-4o + Spot-IT | 46 | 56 | |
Gemini-1.5-Flash | 22 | 28 |
@inproceedings{thakkar-etal-2025-finding, title = "Finding Needles in Images: Can Multi-modal {LLM}s Locate Fine Details?", author = "Thakkar, Parth and Agarwal, Ankush and Kasu, Prasad and Bansal, Pulkit and Devaguptapu, Chaitanya", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.1152/", pages = "23626--23648", ISBN = "979-8-89176-251-0" }