Logo OCR-Reasoning

Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang1, Yongxin Shi1, Dezhi Peng *2, Songxuan Lai2, Zecheng Xie2, Lianwen Jin *1,

1 South China University of Technology, 2 Huawei Cloud

🔔News

🔥[2025-05-24] The evaluation of OCR-Reasoning is now supported in VLMEvalKit.🚀

🔥[2025-05-22] Our paper is now accessible at arXiv.🚀

🔥[2025-05-18] Introducing OCR-Reasoning. Release the Data and evaluation script. 🚀

Introduction

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed.

OCR-Reasoning Benchmark

Overview

We introduce OCR-Reasoning, a novel benchmark specifically designed to evaluate the text-rich image reasoning skills of Multimodal Large Language Models (MLLMs). Specifically, OCR-Reasoning comprises a meticulously collected 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks commonly encountered in text-rich visual contexts. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning simultaneously annotates the reasoning process and the answers. This comprehensive annotation enables deeper insights into the problem-solving strategies employed by state-of-the-art models.


OCR-Reasoning is designed to measure six reasoning skills of MLLMs: Spatial Reasoning, Numerical Analysis, Logical Reasoning, Mathematical Reasoning, Multidisciplinary Knowledge, and Enumerative Reasoning.

algebraic reasoning

Comparisons with Existing Benchmarks

Through a simple comparison with existing datasets, we observe that in most cases the answers in existing datasets are directly present in the images, whereas our benchmark contains very few samples of this type. This implies that in our benchmark, to obtain the answer, the model needs to engage in reasoning rather than extracting it from the OCR results of the image.

algebraic reasoning

Experiment Results

Leaderboard on OCR-Reasoning (test)

Accuracy scores on the test subset (1069 examples) of Logo OCR-Reasoning.

# Model Method Source Date ALL SR NAR MR ER LR MKR
1 DouBao-1.5-Vision-Pro 🥇 MLLM 🖼️ Link 2025-05-22 46.8 27.5 54.0 33.3 50.8 34.7 58.4
2 OpenAI-o1 🥈 MLLM 🖼️ Link 2025-05-22 44.4 27.5 46.2 43.1 50.8 40.3 49.6
3 Gemini-2.0-Flash 🥉 MLLM 🖼️ Link 2025-05-22 39.3 19.3 47.2 24.5 49.7 36.8 32.1
4 Qwen2.5-VL-72B MLLM 🖼️ Link 2025-05-22 37.5 24.8 44.7 22.5 47.5 28.5 34.3
5 Qwen2.5-VL-32B MLLM 🖼️ Link 2025-05-22 36.2 21.1 38.7 25.5 46.9 34.7 36.5
6 Claude-3.7-Sonnet MLLM 🖼️ Link 2025-05-22 35.8 20.2 35.4 23.5 60.3 30.6 32.1
7 OpenAI-o3-mini Tool 🛠️ Link 2025-05-22 33.3 17.4 41.2 25.5 41.3 24.3 27.7
8 GPT-4o MLLM 🖼️ Link 2025-05-22 30.7 21.1 35.9 18.6 40.8 26.4 23.4
9 Llama4-Scout-109B-A17B MoE 🤖 Link 2025-05-22 27.7 15.6 34.7 16.7 41.3 22.9 12.4
10 DeepSeek-R1-Distill-Qwen-32B Tool 🛠️ Link 2025-05-22 26.5 11.9 28.9 23.5 34.6 18.8 30.7
11 Kimi-VL-A3B-Thinking MoE 🤖 Link 2025-05-22 20.5 11.9 22.4 14.7 24.6 21.5 19.7
12 InternVL3-78B MLLM 🖼️ Link 2025-05-22 19.9 13.8 22.4 9.8 14.0 27.1 25.5
13 InternVL3-32B MLLM 🖼️ Link 2025-05-22 17.1 14.7 10.3 14.7 24.0 11.8 37.2
14 Qwen2.5-VL-7B MLLM 🖼️ Link 2025-05-22 15.7 13.8 11.6 8.8 20.1 9.0 35.8
15 VL-Rethinker-7B MLLM 🖼️ Link 2025-05-22 14.6 8.3 16.1 9.8 19.6 8.3 19.0
16 VLAA-Thinker-Qwen2.5VL-7B MLLM 🖼️ Link 2025-05-22 14.4 11.9 10.3 7.8 21.2 11.8 27.0
17 MM-Eureka-Qwen-7B MLLM 🖼️ Link 2025-05-22 13.2 9.2 7.0 10.8 18.4 15.3 27.0
18 Qwen2.5-VL-3B MLLM 🖼️ Link 2025-05-22 12.2 11.0 11.8 9.8 19.0 7.6 11.7
19 InternVL3-8B MLLM 🖼️ Link 2025-05-22 11.5 12.8 5.8 11.8 17.9 7.6 22.6
20 InternVL3-2B MLLM 🖼️ Link 2025-05-22 10.8 11.9 4.8 7.8 18.4 11.8 18.3
Method types: MoE 🤖: Mixture of Experts, MMLLM 🖼️: Multimodal Large Language Model, Tool 🛠️: Large Language Model with OCR. ALL: Overall Average Score.
Task types: SR: Spatial Reasoning, NAR: Numerical Analysis Reasoning, MR: Mathematical Reasoning, ER: Enumerative Reasoning, LR: Logical Reasoning, MKR: Multidisciplinary Knowledge Reasoning

Leaderboard for Reasoning Path Scores

Scores on the reasoning path of Logo OCR-Reasoning.

# Model Method Source Date ALL SR NAR MR ER LR MKR
1 DouBao-1.5-Vision-Pro 🥇 MLLM 🖼️ Link 2025-05-22 55.4 38.2 61.8 50.2 52.4 52.8 61.2
2 Claude-3.7-Sonnet 🥈 MLLM 🖼️ Link 2025-05-22 50.3 37.7 55.0 38.8 58.1 48.6 46.5
3 Gemini-2.0-Flash 🥉 MLLM 🖼️ Link 2025-05-22 49.5 31.5 57.1 42.6 49.3 47.4 49.2
4 OpenAI-o1 🥈 MLLM 🖼️ Link 2025-05-22 48.5 36.9 53.9 50.0 39.4 49.4 51.8
5 GPT-4o MLLM 🖼️ Link 2025-05-22 45.4 35.4 48.9 33.0 48.7 48.0 45.5
Method types: MoE 🤖: Mixture of Experts, MMLLM 🖼️: Multimodal Large Language Model, Tool 🛠️: Large Language Model with OCR. ALL: Overall Average Score.
Task types: SR: Spatial Reasoning, NAR: Numerical Analysis Reasoning, MR: Mathematical Reasoning, ER: Enumerative Reasoning, LR: Logical Reasoning, MKR: Multidisciplinary Knowledge Reasoning

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

📌 Key Finding

1. The visual information in images is crucial for the OCR Reasoning task. When we replace images with OCR results and feed them into LLMs, we observe that their accuracy is relatively low. This indicates that text alone is insufficient for solving text-rich image reasoning tasks.


2. Existing models still have room for improvement in OCR reasoning tasks. Even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning.


3. Existing reinforcement learning methods perform poorly on text-rich image reasoning tasks. Designing reinforcement learning for text-rich image reasoning is a potential direction for enhancing text-rich image reasoning capabilities.

Examples of Different Tasks

OCR-Reasoning not only annotates the final answers but also documents the reasoning steps taken to arrive at the answers. Specifically, OCR-Reasoning comprises 1,069 meticulously collected human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks commonly encountered in text-rich visual contexts. We provide some representative examples of these 18 practical reasoning tasks in text-rich visual scenarios.

algebraic reasoning

Error Examples

BibTeX


        @article{huang2025ocreasoning,
          title={OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning}, 
          author={Mingxin Huang and Yongxin Shi and Dezhi Peng and Songxuan Lai and Zecheng Xie and Lianwen Jin},
          journal={arXiv preprint arXiv:2505.17163},
          year={2025},
        }