Inside NuminaMath: The AI Model that Took The First Place In the AI Math Olympiad

The model used strong data curation, fine-tuning processes, and algorithmic improvements to reach the top of the AIMO leaderboard.

Jesus Rodriguez
Towards AI
Published in
7 min readJul 22, 2024

--

Created Using Ideogram

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

The AI Mathematical Olympiad(AIMO) has been one of the most interesting initiatives to evaluate sophisticated math reasoning in AI models. Launched a few months ago, AIMO setup a $10 million prize for models that can reason at the level of a gold medalist in the International Math Olymmpiad(IMO) competitions for high school students. By performing at those levels, AI models need to exhibit sophisticated capabilities in areas such as multi-step reasoning, math as well as deep level language understanding. I was fascinated the AIMO challenge and was tracking the progress of the different models quite closely over the last few months trying to understand the techniques they were using to solve such complex chal.

After months of intervention, NuminaMath 7B TIR emerged as the winner. The model was a collaboration between HuggingFace and Numina, a lab focused on advancing math capabilities in foundation models. You probably know a lot about HuggingFace but very little about Numina so le’s fix that.

Numina is a lab dedicated to advance math capabilities in foundation models. Numina rallies behind that vision that math is essential to humanty and a key component of advances intelligence. The project received initial support from Mistral and firms like General Catalyst and set its eyes on the AIMO challenge as one of its firs major tests.

NuminaMath is a combination of some obvious steps with very novel approaches in terms across different areas. Today, I would like to dive into some of the details behind NuminaMath that could serve as inspirations for AI teams working on similar problems.

NuminaMath

One of the most interesting aspects of NuminaMath is that they build a new architecture from scratch. Instead, they relied on the DeepSeekMath model as a baseline and extend it with a novel approach based on three fundamental components:

i. Fine-tuning Strategy: NuminaMath fine-tuned the DeepSeekMath-Base 7B model to function as a “reasoning agent.” This agent tackled mathematical problems using natural language reasoning combined with Python REPL to compute intermediate results.

ii. Decoding Algorithm: They developed a novel decoding algorithm for tool-integrated reasoning (TIR) that incorporated code execution feedback, enabling the generation of solution candidates during inference.

iii. Internal Validation Sets: Various internal validation sets were used to guide model selection and prevent overfitting to the public leaderboard.

The models were trained using open-source libraries such as TRL, PyTorch, vLLM, and DeepSpeed. Training on one node of 8 x H100 GPUs took approximately 10 hours.

Training Recipe

Fine tuning is, arguably, one of the most interesting areas of contribution of NuminaMath.

The fine-tuning process was divided into two stages:

i. Stage 1: The base model was fine-tuned on a diverse dataset of natural language math problems and solutions. Each solution was templated with Chain of Thought (CoT) to aid reasoning.

ii. Stage 2: The model from Stage 1 was further fine-tuned on a synthetic dataset of tool-integrated reasoning. Problems were broken down into rationales, Python programs, and their outputs. This method, influenced by Microsoft’s ToRA paper, produced a reasoning agent capable of solving problems using both natural language and Python REPL.

Both stages involved “full fine-tuning,” where all model weights were updated during backpropagation. The “packing” feature from TRL’s SFTTrainer was utilized to concatenate multiple samples into a single chunk of 2048 tokens. Gradient checkpointing and the DeepSpeed ZeRO-3 protocol ensured efficient training within available VRAM. Key hyperparameters used in each stage included a learning rate of 2.0 E-5, a total batch size of 32, and a cosine learning rate scheduler.

Initial Attempts and Adjustments

Initial submissions using only Stage 1 fine-tuning yielded limited success. Inspired by Abdur Rafae’s public prize notebook, NuminaMath integrated code execution into their training recipe. They first explored the Mix of Minimal Optimal Sets (MMOS) dataset but found it insufficient for harder problems. This led them to develop a dataset similar to the one used by DeepSeekMath Instruct / RL models, resulting in significant improvements.

Dataset Construction

NuminaMath used two main datasets for its fine-tuning process:

i. Chain of Thought Dataset: Comprised of several hundred thousand problems with solutions written in a Chain of Thought manner. Data sources ranged from Chinese high school math exercises to international mathematics competition problems. The data underwent OCR, segmentation, translation into English, and realignment to produce a Chain of Thought format.

ii. Tool-Integrated Reasoning Dataset: Focused on 60,000 problems from the Numina dataset with numerical outputs. Using a pipeline with GPT-4, they generated TORA-like reasoning paths and executed code to produce results. Solutions were iteratively filtered and refined to ensure accuracy.

SC-TIR Algorithm

To address high variance in model evaluation, NuminaMath developed the SC-TIR algorithm. This involved:

· Copying the input N times to define the initial batch of prompts.

· Sampling N diverse completions until a complete block of Python code was produced.

· Executing each Python block and concatenating the output.

· Repeating the process M times to allow self-correction of code errors.

· Postprocessing and applying majority voting to select the final answer.

For their winning submission, they generated N=48 candidates with a depth of M=4. Quantizing models to 8-bit precision improved upload speed and accommodated GPU constraints without significantly compromising accuracy.

Avoiding Overfitting:

To mitigate overfitting to the public leaderboard, NuminaMath used four internal validation sets, covering problems of varying difficulty. These included datasets from AMC12 (2022, 2023) and AIME (2022, 2023, 2024), along with subsets of the MATH test set. This approach allowed them to select the most promising models and fine-tune hyperparameters effectively, balancing small representative sets with larger ones to manage submission stochasticity.

What Didn’t Work and Promising Ideas

Not everything in NuminaMath was a smashing success. The team tried different ideas such as:

1. CoT Model with Majority Voting: They trained a pure Chain of Thought (CoT) model and evaluated it using majority voting. This method did not yield the desired results.

2. MMOS Model for Single-Step Solutions: They also attempted to train a model based on the Mix of Minimal Optimal Sets (MMOS) to solve problems using a single Python step. This approach was not successful either.

A Promising Approach: Kahneman-Tversky Optimisation (KTO)

Another technique involved applying KTO to new completions sampled from the SFT model. This approach was inspired by OrcaMath and involved the following steps:

- Sampling four completions per problem from the SFT model, using prompts that combined rationales and code execution from the Stage 2 dataset.

- Comparing the extracted answers to the ground truth and labeling the samples as positive if correct and negative if incorrect.

Although this form of on-policy KTO produced a slightly better model than the SFT one, it only resulted in a modest improvement (a few percentage points) on internal evaluations and scored 27/50 on the public leaderboard. One advantage of using KTO was the ability to track the implicit reward during training, which greatly assisted in debugging. For instance, successful training logs showed an increase in rewards for correct solutions while suppressing the rewards for incorrect ones.

Unfortunately, the team didn’t have enough time to include KTO in NuminaMath, but the idea seems quite promising.

The Results

NuminaMath climbed to the top of the AIMO leaderboard by answering 29 of the 50 problems. Notably, the model answered 7 models more than the second place.

NuminaMath represents an important iteration in frontier models for math reasoning. The AIMO prize might be one of the highest levels of testing we can find in terms of math reasoning, and NuminaMath performed at very impressive levels. Hopefully, some of the ideas behind NuminaMath will inspire other models in the math and reasoning space.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Written by Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...

No responses yet

What are your thoughts?