BRIT: Bidirectional Retrieval over Unified Image-Text Graph

Abstract

Enterprise documents containing proprietary terminology and company-specific names often cause traditional embedding-based RAG systems to fail due to poor similarity matching with pre-trained models. BRIT addresses this challenge through a novel multi-modal RAG framework that constructs unified text-image graphs from documents and retrieves query-relevant subgraphs without requiring fine-tuning. By capturing both semantic and spatial relationships between textual and visual elements, BRIT enables effective cross-modal retrieval on enterprise-specific content. We introduce the MM-RAG test set to evaluate multi-modal question answering capabilities that require understanding complex text-image relationships.

Introduction

Documents

Extract chunks

Similarity-based chunk retrieval

(a) Chunk-wise retrieval

Lose relations between chunks

Miss contents that cannot be identified directly from the query

Documents

Convert pages into images

Similarity-based page retrieval

(b) Page-wise retrieval

Lose relations across pages

Miss contents that cannot be identified directly from the query

Documents

Construct a graph

Bidirectional linked-context retrieval

(c) BRIT (Ours)

Preserves multi-modal relationships

Cross-modal retrieval capability

        Why BRIT matters for Enterprises
        BRIT's graph-based approach eliminates the need for document-specific fine-tuning while enabling sophisticated cross-modal retrieval.
      

BRIT Architecture Overview

1

Document Processing

Extract text and images from multi-modal documents, maintaining spatial and semantic relationships

2

Graph Construction

Build unified text-image graph while linking textual nodes and images based on the captions, text-image similarity scores and document layout

3

Query-relevant Sub-graph Retrieval

Use Prize-Collecting Steiner Tree (PCST) optimization to retrieve query-relevant multi-modal subgraphs, where relevance is determined by computing cosine similarity scores between each node and the input query.

4

Response Generation

Generate answers using retrieved text-image context

MM-RAG Benchmark

We introduce a MM-RAG benchmark comprising 500 complex questions that necessitate cross-modal, multi-hop retrieval to identify the key information. The benchmark contains three types of questions:

Text-Image questions: Questions requiring a relevant image for answering but that image cannot be directly identified by the question
Image-Text questions: Questions requiring a relevant texts for answering but texts cannot be directly identified by the question
Image-Image questions: Simple baseline questions which can be answered directly based on the images

Qualitative Examples

Text-Image Question

The baseline retrieves similar images matched with the query, however a correct image cannot be retrieved. BRIT can find the query-relevant texts and then retrieve the connected images to reach the answer.

Image-Text Question

Both the baseline and BRIT retrieve the relevant image, however the baseline cannot find the relevant texts because an important keyword 'Irene Dalton' is not in the question. Our BRIT finds the relevant image first, then discovers relevant texts by following the link between the image and the text.

Results & Analysis

BRIT outperforms the baselines in overall QA accuracy on complex question:

Image-Image questions are the simple question which does not require the crossmodal multi-hop retrieval. Thus, the baseline methods with the simple retrieval of the top-2 most relevant images achieves the highest accuracy, while BRIT only retrieve 1 image with the similarity.
The baseline struggles with Image-Text questions, resulting in a large drop in performance compared to other cases.

Method	Image-Image (%)	Text-Image (%)	Image-Text (%)
CLIP	0.81	0.70	0.76
BLIP	0.82	0.68	0.78
BRIT (Ours)	0.81	0.89	0.81

BRIT: Bidirectional Retrieval over Unified Image-Text Graph

A multi-modal RAG framework for enterprise-specific proprietary documents without fine-tuning

Table of Contents

Why BRIT matters for Enterprises

Text-Image Question

Image-Text Question