GeoX is a multi-modal large model designed for automatic geometric problem solving, incorporating three progressive training stages to enhance diagram understanding and reasoning. In this paper, we validate that the formal vision-language training is a simple-yet-effective paradigm for complex mathematical diagram learning.
Abstract
Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k. Our data and code will be released soon to accelerate future research on automatic GPS.- [2024/12/30] Full version of the code and training scripts will be released within the next few days.
- [2024/10/17] Upload paper and init project. Release the data for GeoX. See here.
Environment Setup
Step 1. Build Dependencies. Our code is tested with Python 3.10.14. To run the codes, you should first install the following packages:
conda create -n geox python=3.10
conda activate geox
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation
Data and Weights Preparation
Step 1. Download and Prepare Data.
- Follow the instructions here and download full dataset for GeoX.
- To train the model, you are required to organize the files into the following folders:
./data/
alignment/
images/
unified_formal_annotations.json
geoqa/
images/
geoqa_train.json
geoqa_test.json
unigeo/
images/
unigeo_train.json
unigeo_test.json
geometry3k/
images/
geometry3k_train.json
geometry3k_test.json
pgps9k/
images/
pgps9k_train.json
pgps9k_test.json
(Optional) Pretraining
# Pretrain Geo-VIT
BASE_DIR="/path/to/your/base/directory" # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory" # Modify this path as necessary
python ${BASE_DIR}/pretrain/pretrain_encoder.py \
--job_dir ${OUTPUT_DIR}/checkpoint/mae \
--nodes 1 \
--ngpus 8 \
--accum_iter 16 \
--batch_size 256 \
--use_volta32 \
--model mae_vit_base_patch16 \
--mask_ratio 0.75 \
--epochs 800 \
--warmup_epochs 40 \
--blr 1.5e-4 \
--weight_decay 0.05 \
--data_path ${BASE_DIR}/data # Ensure the data path is correctly parameterized
# Pretrain Geo-LLM
DATA_FILE="/path/to/your/training/data" # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory" # Modify this path as necessary
MODEL_DIR="/path/to/LLEMMA/directory" # Modify this path as necessary
LOG_FILE="${OUTPUT_DIR}/train.log"
if [ ! -d "${OUTPUT_DIR}" ]; then
mkdir -p "${OUTPUT_DIR}"
fi
GPU_DEVICES=""
deepspeed --include=localhost:${GPU_DEVICES} \
main/train_llm.py \
--config_name "${MODEL_DIR}/config.json" \
--tokenizer_name "${MODEL_DIR}" \
--model_name_or_path "${MODEL_DIR}" \
--train_files "${DATA_FILE}" \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 32 \
--do_train \
--output_dir "${OUTPUT_DIR}" \
--evaluation_strategy steps \
--use_fast_tokenizer false \
--max_eval_samples 0 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 4 \
--num_train_epochs 10 \
--warmup_ratio 0.1 \
--logging_dir "${OUTPUT_DIR}/logs" \
--logging_strategy steps \
--logging_steps 50 \
--save_strategy steps \
--preprocessing_num_workers 10 \
--save_steps 20000000 \
--eval_steps 500000000 \
--save_total_limit 2000 \
--seed 42 \
--disable_tqdm false \
--ddp_find_unused_parameters false \
--block_size 1024 \
--overwrite_output_dir \
--report_to tensorboard \
--run_name llm_pretrain \
--bf16 \
--bf16_full_eval \
--gradient_checkpointing \
--deepspeed configs/models/zero3.json \
--ignore_data_skip true \
--ddp_timeout 18000000 \
| tee -a "${LOG_FILE}"
Finetune on Geometry Data
MODEL_DIR="/path/to/your/model/directory" # Modify this path as necessary
Text_FILE="/path/to/your/training/data" # Modify this path as necessary
IMAGE_FOLDER="/path/to/your/image/folder" # Modify this path as necessary
OUTPUT_DIR="/path/to/your/output/directory" # Modify this path as necessary
LOG_FILE="${OUTPUT_DIR}/train.log"
GPU_DEVICES=""
if [ ! -d "${OUTPUT_DIR}" ]; then
mkdir -p "${OUTPUT_DIR}"
fi
export MASTER_PORT=20728
deepspeed --include=localhost:${GPU_DEVICES} --master_port=$MASTER_PORT main/train_geox.py \
--deepspeed ./configs/models/zero2.json \
--model_name_or_path "${MODEL_DIR}" \
--version geo_v1 \
--data_path "${Text_FILE}" \
--image_folder "${IMAGE_FOLDER}" \
--vision_tower "${MODEL_DIR}/geo-vit.pth" \
--gsformer_path "${MODEL_DIR}/gsformer.pth" \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir "${OUTPUT_DIR}" \
--num_train_epochs 100 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 100 \
--learning_rate 3e-5 \
--weight_decay 0. \
--warmup_ratio 0.05 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 0 \
--lazy_preprocess True \
| tee -a "${LOG_FILE}"
If you find our work helps, please consider starring ⭐ us and citing:
@misc{xia2024geoxgeometricproblemsolving,
title={GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training},
author={Renqiu Xia and Mingsheng Li and Hancheng Ye and Wenjie Wu and Hongbin Zhou and Jiakang Yuan and Tianshuo Peng and Xinyu Cai and Xiangchao Yan and Bin Wang and Conghui He and Botian Shi and Tao Chen and Junchi Yan and Bo Zhang},
year={2024},
eprint={2412.11863},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.11863},
}
Thanks to LLaVA, LAVIS, MAE, and trasnformers. We borrow some of their codes and checkpoints.
This code is distributed under an Apache-2.0 license. If there are any problems regarding our project, please open an issue.