Nanbeige4-3B Model Challenges Larger LLMs with Enhanced Performance and Efficiency

Victor Zhang
Victor Zhang
Nanbeige4-3B small language model (SLM) logo or abstract representation, symbolizing efficiency and advanced AI.

Boss Zhipin's Nanbeige LLM Lab has released Nanbeige4-3B, a small language model (SLM) with 3 billion parameters. This model aims to push the capabilities of smaller models, offering faster inference speeds and lower deployment costs compared to larger language models (LLMs).

Model Development and Features

Nanbeige4-3B underwent pre-training with 23 trillion tokens. Its development included a hybrid data filtering system for quality discrimination and a fine-grained Warmup-Stable-Decay (WSD) scheduler to optimize the use of high-quality data. Post-training involved fine-tuning with over 30 million high-quality instructions and multi-stage reinforcement learning (RL). The model also integrates Chain-of-Thought refinement, large-scale tool invocation environment synthesis, and multi-granularity distillation algorithms.

In comparisons with the Qwen3 series, Nanbeige4-3B reportedly surpassed Qwen3-4B and Qwen3-8B. It also demonstrated competitive performance against larger Qwen models in certain metrics. For complex tasks like AIME and GPQA, Nanbeige4-3B outperformed Qwen3-32B and Qwen3-30B-A3B.

The model also showed strong performance in practical evaluations. On the BFCl-V4 tool invocation benchmark, its score was more than 10% higher than Qwen3-32B and Qwen3-30B-A3B. Its performance on the Arena-Hard-V2 human preference alignment leaderboard was comparable to Qwen3-30B-A3B. In the November 2025 WritingBench leaderboard, Nanbeige4-3B ranked 11th among 54 models, with creative capabilities described as comparable to trillion-parameter models like DeepSeek-R1-0528.

Pre-training Methodology

Nanbeige4-3B's pre-training focused on data quality and scheduling. A hybrid quality filtering system was developed to evaluate data based on "intrinsic attributes from quality tags" and "external alignment from retrieval recall." This system used 20 quality tags and a retrieval recall mechanism against a database of billions of mixed retrievals. This process selected 12.5 trillion tokens of high-quality data, with an additional 6.5 trillion tokens of high-scoring data upsampled, resulting in a 23 trillion token training corpus.

The Fine-Grained WSD scheduler dynamically adjusted data ratios during the constant learning rate phase. Early training emphasized corpus diversity, while later stages focused on higher-quality data. This strategy was implemented across four stages: Warmup (0.1T tokens), Diversity Stable Stage (12.4T), High-Quality Stable Stage (6.5T), and Decay Stage (4T). During the decay stage, the context length was extended to 64K using the Adjusting Base Frequency (ABF) method.

For performance evaluation, a Post-SFT paradigm was used, applying the same fine-tuning process to Nanbeige4-3B and other open-source Base models before comparing their performance on downstream tasks.

Post-training Process

The post-training for Nanbeige4-3B involved a four-stage process:

Cold-Start SFT

This stage utilized 30 million high-quality mathematical, scientific, and code samples, with mathematical reasoning accounting for 50%, scientific reasoning for 30%, and code tasks for 20%. The model's performance on tasks like AIME 2025 and GPQA-Diamond improved as the SFT data increased.

Overall SFT

This stage aimed to enhance the model's comprehensive capabilities, including human preference alignment and tool invocation. For human preference alignment, a deliberative generation and Chain-of-Thought reconstruction paradigm was used. This involved multi-dimensional evaluation checklists, multiple teacher models for candidate answers, and a dedicated evaluation model for cross-scoring. A Chain-Completion Model was also trained to reconstruct logical Chain-of-Thought paths. For tool invocation, a multi-agent data synthesis strategy was employed, using LLMs to simulate user-assistant-environment interactions.

Distillation

Nanbeige4-3B was distilled using the Nanbeige3.5-Pro flagship model. A Dual-Level Preference Distillation (DPD) algorithm was developed, optimizing token-level and sequence-level distribution alignment. This method aimed to transfer large model capabilities to the smaller model, showing improvements across various evaluation dimensions, including AIME (+8%), GPQA (+10%), and BFCL-V4 (+30%).

Reinforcement Learning (RL)

A phased, domain-specific RL strategy was adopted. The first stage optimized performance on mathematical and scientific problems using a tool-augmented verifier. The second stage focused on code programming capabilities through data synthesis and code sandbox verification. The third stage improved human preference alignment tasks, such as writing and open-ended questioning, by optimizing the reward model.

Performance Evaluation

Nanbeige4-3B's performance was evaluated against open-source models of similar and larger scales. On the AIME 2024 and AIME 2025 mathematical reasoning tasks, Nanbeige4-3B scored 90.4 and 85.6, respectively, surpassing Qwen3-32B (81.4 / 72.9) and Qwen3-30B-A3B (89.2 / 85.0). In the scientific domain, it scored 82.2 on the GPQA-Diamond task, higher than Qwen3-32B (68.7) and Qwen3-30B-A3B (73.4). On the BFCL-V4 tool invocation benchmark, Nanbeige4-3B demonstrated strong capabilities among lightweight open-source models. In the ArenaHard-V2 evaluation, Nanbeige4-3B achieved a score of 60.0, comparable to Qwen3-30B-A3B.

The Nanbeige4-3B Base model, Thinking model, and technical report are available for download.