Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?

A showcase of the NiM-Benchmark and Spot-IT for fine-grained document AI

Parth Thakkar1,*, Ankush Agarwal1,*, Prasad Kasu1, Pulkit Bansal1,2, Chaitanya Devaguptapu1

1Fujitsu Research India 2Indian Institute of Technology Patna

*Equal contribution

Abstract
Introduction
NiM-Benchmark
Spot-IT Method
Qualitative Examples
Results & Analysis
Implementation Details
Authors
How to Cite

Abstract

Multi-modal Large Language Models (MLLMs) can read entire documents, but can they find hidden, precise details buried in complex layouts? NiM-Benchmark introduces tough new VQA challenges: fine-grained, localized visual reasoning, like spotting menu prices or a tiny table insight on a research paper. Spot-IT delivers an intuitive approach that helps MLLMs "zoom in" — intelligently picking image patches and applying a dynamic Gaussian focus guided by the user's query. Large-scale evaluation sets new accuracy records and reveals just how far these models still lag behind human readers, especially for tiny, context-heavy details.

Introduction

Vision-Language Models are bridging the gap between seeing and understanding, but their real-world utility hinges on precision: can they find just the answer—no more, no less—when it occupies only a sliver of a dense document? Most DocVQA benchmarks miss this, focusing on broad summaries or generic extraction, not the real-world pain-point of hidden, context-rich queries.

        Why does this matter?
        Visual document QA fails if models can't locate tiny, vital details—just as users do when flipping desperately through legal contracts, academic papers, or digital receipts!
      

NiM-Benchmark: Fine-Grained Challenges

NiM-Benchmark is crafted for precision stress-testing, with 2,970 images and 1,180 question-answer pairs, from everyday sources:

Restaurant menus (prices, dish attributes)
Academic papers (tiny figure/table details, technical params)
Magazines, newspapers
Website & lecture screenshots

Each question targets a region covering <5% of the visual area. Domain variety and question complexity is ensured (see paper Table 9).

Domain	Sample Types
Restaurant Menus	Prices, dish ingredients, dietary info
Academic Papers	Figure/table values, technical parameters
Magazines / Newspapers	Date, events, entities, scores
Websites / Lectures	UI details, content, nav elements

Spot-IT: Human-Like Visual Focus for MLLMs

1. Query-Guided Patch Identification

Divide image into n × n patches (n=6),
Vision-language model (SigLip) matches patches with the query (cosine similarity),
Pick patch with highest similarity for focus,

2. Dynamic Gaussian Attention

Draw a Gaussian mask centered at patch, width based on confidence,
Sharper focus for strong matches, softer for weak ones,
Blend region into original image & send to MLLM,

3. Answer Generation

Standard DocVQA prompt (with focus area),
No need to retrain model.

        What makes this human like?
        Spot-IT lets models zoom in like a human—keeping the page in view but staring at just the vital spot.
      

Qualitative Examples

Restaurant Menu Analysis

Spot-IT successfully identifies and focuses on specific menu items and prices within complex restaurant menu layouts. The system can precisely locate fine details like individual dish prices, ingredients, and dietary information that occupy small regions of the overall document.

Lecture Screenshot Extraction

Academic content presents unique challenges with dense information layouts. Spot-IT demonstrates its ability to extract fine-grained details from lecture slides, focusing on specific data points, formulas, or textual elements that would be difficult to identify without targeted attention mechanisms.

Website UI Element Focus

Web interfaces contain numerous small interactive elements and textual details. This example showcases how Spot-IT can pinpoint specific UI components, navigation elements, or content sections within complex website layouts, enabling precise information extraction from digital interfaces.

Results & Analysis

Spot-IT was tested on open-source and leading commercial MLLMs. Highlights:

15-21% accuracy improvement on NiM-Benchmark and ArxiVQA, over basic MLLM/OCR DocVQA pipelines.
Even GPT-4o achieves only 38% EM (humans get 63%), showing how tough fine-grained questions are.
Most failures: model hallucination, missing tiny evidence, or reasoning breakdown.
Latency only +4s—acceptable for detailed search.

Model	NiM EM (%)	NiM F1 (%)	Human EM (%)
GPT-4o	38	48	63
GPT-4o + Spot-IT	46	56
Gemini-1.5-Flash	22	28

        Bottom line: Spot-IT is a step ahead for fine-grained DocVQA, but there's lots of room to match humans!
      

Implementation Details

Dataset : Hugging Face (see top links).
Spot-IT is a plug-in step—no need to retrain! Works across modern MLLMs.
Experiments done on NVIDIA A30 GPUs and with all major API-access models.

Authors & Affiliations

Parth Thakkar1,*, Ankush Agarwal1,*, Prasad Kasu1, Pulkit Bansal1,2,†, Chaitanya Devaguptapu1

1Fujitsu Research India 2Indian Institute of Technology Patna

*Equal contribution †Work partially at Fujitsu Research India

How to Cite

@inproceedings{thakkar-etal-2025-finding,
    title = "Finding Needles in Images: Can Multi-modal {LLM}s Locate Fine Details?",
    author = "Thakkar, Parth  and
      Agarwal, Ankush  and
      Kasu, Prasad  and
      Bansal, Pulkit  and
      Devaguptapu, Chaitanya",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1152/",
    pages = "23626--23648",
    ISBN = "979-8-89176-251-0"
}

Table of Contents

Why does this matter?