Enterprise documents containing proprietary terminology and company-specific names often cause traditional embedding-based RAG systems to fail due to poor similarity matching with pre-trained models. BRIT addresses this challenge through a novel multi-modal RAG framework that constructs unified text-image graphs from documents and retrieves query-relevant subgraphs without requiring fine-tuning. By capturing both semantic and spatial relationships between textual and visual elements, BRIT enables effective cross-modal retrieval on enterprise-specific content. We introduce the MM-RAG test set to evaluate multi-modal question answering capabilities that require understanding complex text-image relationships.
We introduce a MM-RAG benchmark comprising 500 complex questions that necessitate cross-modal, multi-hop retrieval to identify the key information. The benchmark contains three types of questions:
The baseline retrieves similar images matched with the query, however a correct image cannot be retrieved. BRIT can find the query-relevant texts and then retrieve the connected images to reach the answer.
Both the baseline and BRIT retrieve the relevant image, however the baseline cannot find the relevant texts because an important keyword 'Irene Dalton' is not in the question. Our BRIT finds the relevant image first, then discovers relevant texts by following the link between the image and the text.
BRIT outperforms the baselines in overall QA accuracy on complex question:
| Method | Image-Image (%) | Text-Image (%) | Image-Text (%) |
|---|---|---|---|
| CLIP | 0.81 | 0.70 | 0.76 |
| BLIP | 0.82 | 0.68 | 0.78 |
| BRIT (Ours) | 0.81 | 0.89 | 0.81 |