Prime Intellect Open-Sources INTELLECT-3 MoE Model & RL Stac

Prime Intellect has officially released INTELLECT-3, a 106-billion-parameter Mixture-of-Experts (MoE) model. The company states that INTELLECT-3 achieves strong performance among models of its size across benchmarks in mathematics, code, science, and reasoning. Prime Intellect has open-sourced the complete training pipeline, including model weights, the training framework, datasets, reinforcement learning (RL) environments, and evaluation systems.

The INTELLECT-3 model was trained using Prime Intellect's proprietary RL technology stack. The company intends for the open-sourcing of this technology stack to support broader research and development in large-scale RL. The training software and infrastructure used for INTELLECT-3 are the same as those planned for release on the Prime Intellect platform, which the company says will enable wider access to advanced model post-training capabilities.

Training Framework and Infrastructure

INTELLECT-3 is an MoE model with 106 billion parameters, fine-tuned using supervised learning and RL based on GLM 4.5 Air. Its training involved several core components:

PRIME-RL: Prime Intellect's distributed RL framework, designed for supervised fine-tuning and RL for large-scale MoE models.
Verifiers and Environments Hub: A unified interface and ecosystem for agent-based RL environments and evaluations.
Prime Sandboxes: A high-throughput, secure code execution system for agent code environments.
Compute Orchestration: Scheduling and management across 512 NVIDIA H200 GPUs distributed over 64 interconnected nodes.

The PRIME-RL framework, used for end-to-end training, integrates with the Verifiers environment to support the entire post-training process, from synthetic data generation to evaluation. This system connects with the Environments Hub, providing access to a range of environments and evaluation tasks.

PRIME-RL operates as a fully distributed, asynchronous framework. Prime Intellect's research team indicated that distributed training is necessary for scaling RL, particularly for long-horizon agent rollouts, by mitigating speed bottlenecks. The development of INTELLECT-3 followed six months of ablation experiments focused on performance, stability, and large-scale efficiency. Prime Intellect plans to offer PRIME-RL on its upcoming Lab platform, aiming to simplify large-scale RL training for users.

Training Environment and Compute

The training environment for INTELLECT-3 was constructed using the Verifiers library and hosted on the Environments Hub, Prime Intellect's community-focused RL environment and evaluation center. Verifiers is an open-source tool for building RL environments and evaluation tasks, offering modular components for complex environment logic and high performance. The Environments Hub publishes Verifier-based environments as independent, version-locked Python modules, allowing for separate versioning and iteration of tasks. All environments and evaluations utilized by INTELLECT-3 have been made public on the Environments Hub.

Prime Intellect expanded its Sandboxes infrastructure to support RL. This system handles secure execution of external code across thousands of concurrent rollouts, requiring low-latency container orchestration. Prime Sandboxes bypasses the Kubernetes control plane, communicating directly with pods via Rust to achieve near-native process latency. It can start within 10 seconds, even with high concurrency, and each node can run hundreds of isolated sandboxes. Researchers parallelize sandbox startup with the model's initial inference round, eliminating perceived waiting times before code execution.

The training utilized 512 NVIDIA H200 GPUs across 64 interconnected nodes. Engineering challenges included maintaining determinism and synchronization in a distributed system prone to hardware failures. Resource preparation involved Ansible for infrastructure as code, automated hardware discovery, and InfiniBand pre-checks to identify faulty nodes. Scheduling used Slurm and cgroup v2 to ensure clean task exits. Storage relied on Lustre for high-throughput training I/O and NVMe NFS for metadata and SSH storage. Observability was managed via DCGM and Prometheus for monitoring and decommissioning unstable nodes.

Training Plan and Future Focus

The training of INTELLECT-3 involved two primary stages: supervised fine-tuning based on GLM-4.5-Air, followed by large-scale RL training. Both stages, along with multiple rounds of ablation experiments, ran on 512 H200 GPUs for two months. Researchers trained the model across diverse RL environments covering mathematics, code, science, logic, deep research, and software engineering to enhance its reasoning and agent capabilities. All environments are publicly available on the Environments Hub, along with standardized implementations for benchmark tests.

Prime Intellect's future work will focus on:

Expanding Agent-based RL: Continued training with an increased emphasis on agent environments to improve performance across more tasks.
Richer RL Environments: Leveraging the Environments Hub's collection of over 500 tasks, which include research, computer usage, theorem proving, automation, and specialized fields. INTELLECT-3 used a fraction of these, and future efforts will aim to incorporate more community tasks.
Long-horizon Agents: Developing models that can self-manage context, including pruning, branching inference, and maintaining external memory, to enable trainable long-horizon behaviors via RL. Future research will also explore environments that specifically reward long-horizon reasoning.

Prime Intellect states its goal is to build an open superintelligence technology stack, making advanced model training accessible. The company suggests that INTELLECT-3 demonstrates the capability of smaller labs to train models competitive with those from larger organizations.