BRIT: Bidirectional Retrieval over Unified Image-Text Graph

A multi-modal RAG framework for enterprise-specific proprietary documents without fine-tuning

Ainulla Khan*, Yamada Moyuru*, Srinidhi Akella
Fujitsu Research India
*Equal contribution   

Table of Contents

Abstract

Enterprise documents containing proprietary terminology and company-specific names often cause traditional embedding-based RAG systems to fail due to poor similarity matching with pre-trained models. BRIT addresses this challenge through a novel multi-modal RAG framework that constructs unified text-image graphs from documents and retrieves query-relevant subgraphs without requiring fine-tuning. By capturing both semantic and spatial relationships between textual and visual elements, BRIT enables effective cross-modal retrieval on enterprise-specific content. We introduce the MM-RAG test set to evaluate multi-modal question answering capabilities that require understanding complex text-image relationships.

Introduction
Documents
Extract chunks
Similarity-based chunk retrieval
(a) Chunk-wise retrieval
Lose relations between chunks
Miss contents that cannot be identified directly from the query
Documents
Convert pages into images
Similarity-based page retrieval
(b) Page-wise retrieval
Lose relations across pages
Miss contents that cannot be identified directly from the query
Documents
Construct a graph
Bidirectional linked-context retrieval
(c) BRIT (Ours)
Preserves multi-modal relationships
Cross-modal retrieval capability

Why BRIT matters for Enterprises

BRIT's graph-based approach eliminates the need for document-specific fine-tuning while enabling sophisticated cross-modal retrieval.
BRIT Architecture Overview
architecture
1
Document Processing
Extract text and images from multi-modal documents, maintaining spatial and semantic relationships
2
Graph Construction
Build unified text-image graph while linking textual nodes and images based on the captions, text-image similarity scores and document layout
3
Query-relevant Sub-graph Retrieval
Use Prize-Collecting Steiner Tree (PCST) optimization to retrieve query-relevant multi-modal subgraphs, where relevance is determined by computing cosine similarity scores between each node and the input query.
4
Response Generation
Generate answers using retrieved text-image context
MM-RAG Benchmark

We introduce a MM-RAG benchmark comprising 500 complex questions that necessitate cross-modal, multi-hop retrieval to identify the key information. The benchmark contains three types of questions:

Qualitative Examples
example-1

Text-Image Question

The baseline retrieves similar images matched with the query, however a correct image cannot be retrieved. BRIT can find the query-relevant texts and then retrieve the connected images to reach the answer.

example-2

Image-Text Question

Both the baseline and BRIT retrieve the relevant image, however the baseline cannot find the relevant texts because an important keyword 'Irene Dalton' is not in the question. Our BRIT finds the relevant image first, then discovers relevant texts by following the link between the image and the text.

Results & Analysis

BRIT outperforms the baselines in overall QA accuracy on complex question:

Method Image-Image (%) Text-Image (%) Image-Text (%)
CLIP 0.81 0.70 0.76
BLIP 0.82 0.68 0.78
BRIT (Ours) 0.81 0.89 0.81