TLDR
This paper proposes an multimodal large language model (MLLM) benchmark MathVerse for evaluating the mathematical problem-solving capability of MLLMs. The author also proposes Chain of Thought (CoT) for fine-grained evaluation. Results show that MLLMs are not able to use the visual information, and even performs better without visual input.
Introduction
Multiple benchmarks have been proposed to evaluate the mathematical reasoning capability of MLLMs, such as GeoQA, UniGeo, MathVista and MMMU. However, the problem of existing benchmarks are:
- The MLLMs may depend on text input instead of visual inputs to solve the problems.
- The evaluation process is a black-box, we do not know at which step the MLLM makes a mistake.
The author proposes MathVerse to solve the aforementioned problems.
Method
Data collection
MathVerse contains $2612$ visual math problems and can be divided into plan geometry ($1746$), solid geometry ($332$) and functions ($534$).
Text input processing
To check if MLLMs use visual input to solve the mathematical problems, the author decompose the text input into three categories:
- Descriptive information: Directly observable and clearly portrayed content in the diagram.
- Implicit Property: Higher level of visual perception but less mathematical knowledge.
- Essential Condition: specific numerical or algebraic measurements.
The dataset is then augmented by creating six versions of each problem:
- Text-dominant version: all text input and visual inputs are kept.
- Text-lite version: description information is discarded.
- Text-only version: visual input is discarded
- Vision-intensive version: description information and implicit property are discarded.
- Vision-dominant version: description information and essential condition are discarded.
- Vision-only version: text input is discarded
CoT evaluation
To visualize the reasoning process of MLLM when solving mathematical problems, the author employ GPT-4 to generate key steps of solving the problem. Then scores are given to each reasoning step, in this way, the reasoning process can be tested.
Result
Experiment results show that:
- MLLMs rely more on description information than seeing diagrams
- LLMs Achieve Competitive Results to MLLMs.