Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?

A showcase of the NiM-Benchmark and Spot-IT for fine-grained document AI

Parth Thakkar1,*, Ankush Agarwal1,*, Prasad Kasu1, Pulkit Bansal1,2, Chaitanya Devaguptapu1
1Fujitsu Research India    2Indian Institute of Technology Patna
*Equal contribution   

Table of Contents

Abstract

Multi-modal Large Language Models (MLLMs) can read entire documents, but can they find hidden, precise details buried in complex layouts? NiM-Benchmark introduces tough new VQA challenges: fine-grained, localized visual reasoning, like spotting menu prices or a tiny table insight on a research paper. Spot-IT delivers an intuitive approach that helps MLLMs "zoom in" — intelligently picking image patches and applying a dynamic Gaussian focus guided by the user's query. Large-scale evaluation sets new accuracy records and reveals just how far these models still lag behind human readers, especially for tiny, context-heavy details.

Introduction

Vision-Language Models are bridging the gap between seeing and understanding, but their real-world utility hinges on precision: can they find just the answer—no more, no less—when it occupies only a sliver of a dense document? Most DocVQA benchmarks miss this, focusing on broad summaries or generic extraction, not the real-world pain-point of hidden, context-rich queries.

Why does this matter?

Visual document QA fails if models can't locate tiny, vital details—just as users do when flipping desperately through legal contracts, academic papers, or digital receipts!
NiM-Benchmark: Fine-Grained Challenges

NiM-Benchmark is crafted for precision stress-testing, with 2,970 images and 1,180 question-answer pairs, from everyday sources:

Each question targets a region covering <5% of the visual area. Domain variety and question complexity is ensured (see paper Table 9).

Domain Sample Types
Restaurant Menus Prices, dish ingredients, dietary info
Academic Papers Figure/table values, technical parameters
Magazines / Newspapers Date, events, entities, scores
Websites / Lectures UI details, content, nav elements
Spot-IT: Human-Like Visual Focus for MLLMs

1. Query-Guided Patch Identification

  • Divide image into n Ă— n patches (n=6),
  • Vision-language model (SigLip) matches patches with the query (cosine similarity),
  • Pick patch with highest similarity for focus,

2. Dynamic Gaussian Attention

  • Draw a Gaussian mask centered at patch, width based on confidence,
  • Sharper focus for strong matches, softer for weak ones,
  • Blend region into original image & send to MLLM,

3. Answer Generation

  • Standard DocVQA prompt (with focus area),
  • No need to retrain model.

What makes this human like?

Spot-IT lets models zoom in like a human—keeping the page in view but staring at just the vital spot.
Qualitative Examples
NiM Spot-IT: Restaurant Menu

Restaurant Menu Analysis

Spot-IT successfully identifies and focuses on specific menu items and prices within complex restaurant menu layouts. The system can precisely locate fine details like individual dish prices, ingredients, and dietary information that occupy small regions of the overall document.

NiM Spot-IT: Lecture Screenshot

Lecture Screenshot Extraction

Academic content presents unique challenges with dense information layouts. Spot-IT demonstrates its ability to extract fine-grained details from lecture slides, focusing on specific data points, formulas, or textual elements that would be difficult to identify without targeted attention mechanisms.

NiM Spot-IT: Website Screenshot

Website UI Element Focus

Web interfaces contain numerous small interactive elements and textual details. This example showcases how Spot-IT can pinpoint specific UI components, navigation elements, or content sections within complex website layouts, enabling precise information extraction from digital interfaces.

Results & Analysis

Spot-IT was tested on open-source and leading commercial MLLMs. Highlights:

Model NiM EM (%) NiM F1 (%) Human EM (%)
GPT-4o 38 48 63
GPT-4o + Spot-IT 46 56
Gemini-1.5-Flash 22 28
Bottom line: Spot-IT is a step ahead for fine-grained DocVQA, but there's lots of room to match humans!
Implementation Details
Authors & Affiliations
Parth Thakkar1,*, Ankush Agarwal1,*, Prasad Kasu1, Pulkit Bansal1,2,†, Chaitanya Devaguptapu1
1Fujitsu Research India    2Indian Institute of Technology Patna
*Equal contribution    †Work partially at Fujitsu Research India
How to Cite
@inproceedings{thakkar-etal-2025-finding,
    title = "Finding Needles in Images: Can Multi-modal {LLM}s Locate Fine Details?",
    author = "Thakkar, Parth  and
      Agarwal, Ankush  and
      Kasu, Prasad  and
      Bansal, Pulkit  and
      Devaguptapu, Chaitanya",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1152/",
    pages = "23626--23648",
    ISBN = "979-8-89176-251-0"
}