Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Zejun Ma, Bo An

*: Equal Contribution

Github: https://github.com/ltzheng/SimpleTIR

— July 2, 2025

Figure 1: Red Line: The training dynamics of SimpleTIR starting from Qwen2.5-7B base model, without cold start, format reward, or reward models. It clearly outperforms text reasoning methods without TIR. Purple Line: The unstable training dynamics of Naive Multi-turn Training, which shows erratic performance on AIME24 and suffers from catastrophic gradient norm explosion.

Figure 1: Red Line: The training dynamics of SimpleTIR starting from Qwen2.5-7B base model, without cold start, format reward, or reward models. It clearly outperforms text reasoning methods without TIR. Purple Line: The unstable training dynamics of Naive Multi-turn Training, which shows erratic performance on AIME24 and suffers from catastrophic gradient norm explosion.

<aside> 💡

Key takeaways

The Challenge of End-to-End Multi-Turn Training

Training Large Language Models (LLMs) to perform multi-turn Tool-Integrated Reasoning (TIR) — where LLMs iteratively generate code, execute it, and think upon the execution results — represents one of the most promising frontiers in Reinforcement Learning (RL). This capability enables models to tackle complex mathematical problems, conduct sophisticated data analysis, and perform multi-step reasoning that mirrors human problem-solving approaches.

Figure 2: A demonstration of multi-turn TIR. The model leverages multi-turn TIR, recognizing errors in code blocks and trying a different approach to verify the answer.

Figure 2: A demonstration of multi-turn TIR. The model leverages multi-turn TIR, recognizing errors in code blocks and trying a different approach to verify the answer.

Despite its clear potential, multi-turn TIR training faces a surprisingly difficult challenge: when starting from base models with only verifiable rewards (Zero RL), the training process consistently fails, exhibiting severe instability and entropy collapse that renders it virtually unusable, as shown in Figure 3.

Figure 3: Naive multi-turn training exhibits frequent gradient norm explosion, entropy collapse, irregular response lengths, and unstable code generation ratios.

Figure 3: Naive multi-turn training exhibits frequent gradient norm explosion, entropy collapse, irregular response lengths, and unstable code generation ratios.

This isn't an isolated problem. Research teams working on RAGEN, ZeroTIR, Issue in VeRL, Kevin and Kimi-Researcher, have all reported similar phenomena when applying reinforcement learning to multi-turn scenarios. What should be a natural extension of single-turn TIR training becomes instead an ongoing struggle with catastrophic instability.

Across several runs, we observe that around steps 35–40, the model begins generating repetitive or nonsensical responses. We hypothesize that this is because the model has deviated into a region of instability. —— from Kevin report