NVIDIA Unveils Major CUDA 13.1 Update and Secures AGI Competition Win

Victor Zhang
Victor Zhang
NVIDIA CUDA 13.1 logo with a GPU chip and abstract AI elements, symbolizing advanced computing and AGI.

As AI systems move beyond text and into more complex reasoning, NVIDIA has made significant advancements on two fronts: a foundational software update and a notable achievement in an artificial general intelligence (AGI) competition. The company has introduced CUDA 13.1, described as the most comprehensive update to the CUDA platform in two decades, alongside a victory in the Kaggle ARC Prize 2025 competition.

Highlights

  • CUDA 13.1 Launch: This update introduces the CUDA Tile programming model, designed to simplify development for next-generation GPUs like Blackwell by abstracting low-level hardware details.

  • Kaggle ARC Prize 2025 Victory: A team from NVIDIA, KGMoN, secured first place in the AGI competition using a 4-billion-parameter model variant, demonstrating high performance at a low inference cost.

Context

The CUDA 13.1 update aims to streamline the development process for high-performance computing on NVIDIA GPUs. Meanwhile, the Kaggle ARC Prize 2025 is widely regarded as a benchmark for measuring progress toward AGI, focusing on an AI's ability to generalize to unfamiliar problems rather than relying on rote memorization or pattern matching.

Engineering Notes

CUDA 13.1 introduces several key features:

  • CUDA Tile Programming Model: This model allows developers to program GPUs at a higher abstraction level, specifying operations on data blocks (Tiles) rather than individual threads. This approach is intended to ensure compatibility with future GPU architectures and simplify the use of specialized hardware like Tensor Cores. The update includes CUDA Tile IR, a new virtual instruction set architecture, and cuTile Python, a domain-specific language for array and Tile-based kernels.

  • Green Context Enhancements: The Green Context, a lightweight alternative to the traditional CUDA Context, is now accessible via the runtime API. It enables more granular spatial partitioning and resource provisioning on the GPU, allowing developers to dedicate specific Streaming Multiprocessors (SMs) to particular contexts.

  • CUDA Multi-Process Service (MPS) Updates: New features include Memory Locality Optimized Partitioning (MLOPart) for Blackwell GPUs, which creates dedicated CUDA devices to improve memory locality, and Static Streaming Multiprorocessor Partitioning for Ampere and newer architectures, providing exclusive SM partitions for MPS clients to enhance resource isolation.

  • Developer Tools: NVIDIA Nsight Compute 2025.4 now supports analyzing CUDA Tile Kernels, offering detailed statistics and source code mapping. NVIDIA Compute Sanitizer 2025.4 adds compile-time patching for memory error detection, improving performance and accuracy. NVIDIA Nsight Systems 2025.6.1 includes new tracing features for system-level CUDA, host function, and hardware tracing.

  • Math Libraries: Updates to cuBLAS include an experimental API with Grouped GEMM for FP8 and BF16/FP16 on Blackwell GPUs, offering up to 4x speedup in MoE use cases. cuSPARSE introduces a new SpMVOp API with improved performance. cuFFT provides device APIs for querying or generating device function code. Performance enhancements for cuBLAS and cuSOLVER on Blackwell GPUs are also noted, with significant speedups observed in various operations.

  • NVIDIA CUDA Core Compute Libraries (CCCL): CCCL 3.1 offers new floating-point determinism options for cub::DeviceReduce, allowing developers to balance determinism with performance. Additionally, a more convenient single-stage CUB API simplifies temporary storage management by accepting a memory resource.

What Comes Next

The NVIDIA team's victory in the Kaggle ARC Prize 2025 highlights an alternative approach to AGI development. Their solution, NVARC, utilized a 4B parameter model (Qwen3) combined with an extensive synthetic data generation pipeline. Instead of relying on large models, the team generated 3.2 million synthetic data points by having a 120B parameter open-source large model (gpt-oss-120b) create Python code to define puzzle inputs and outputs. This "brute force aesthetics" approach to data generation, coupled with a smaller, specialized model, demonstrated that high-quality synthetic reasoning data can be a more critical scaling factor than model parameters alone.

For developers, the NVARC solution employed Test-Time Training (TTT), where the model is quickly fine-tuned with LoRA technology using the few examples provided for each new puzzle. Depth-First Search (DFS) was also used during inference, allowing the model to generate and verify Python code solutions. This strategy enabled the model to adapt to unique problem styles and explore multiple code paths for optimal solutions.

The NVARC team's methodology suggests that future advancements in AGI may increasingly depend on sophisticated problem-creation methods and targeted data generation rather than solely on increasing model size.