DeepSeek Introduces mHC Architecture, Enhancing AI Model Performance with Minimal Overhead

Alex Chen
Alex Chen
Intricate, glowing neural network structure symbolizing advanced AI architecture and data processing.

DeepSeek has unveiled a new AI architecture, "manifold-constrained hyper-connections" (mHC), detailed in a paper co-authored by Liang Wenfeng. The architecture, which reportedly achieves significant performance improvements on a 27-billion-parameter model with approximately a 6.7% increase in training time, was published on the first day of 2026.

mHC: Manifold-Constrained Hyper-Connections

The mHC architecture projects matrices onto a constrained manifold to optimize the residual connection space, aiming to ensure stability. This approach allows for the expansion of residual stream width with negligible computational and memory costs. The paper, titled "mHC: Manifold-Constrained Hyper-Connections," is available on arXiv.

A research paper titled 'mHC: Manifold-Constrained Hyper-Connections' open on a desk.
A research paper titled 'mHC: Manifold-Constrained Hyper-Connections' open on a desk.

Jie Zhenda, Wei Yixuan, and Huanqi Cao are credited as core contributors to the research, with Jie Zhenda serving as the corresponding author.

Technical Foundations and Design

The core objective of mHC is to restore the identity mapping property within the topological design of Hyper-Connections (HC), enabling its practical application in large-scale training and foundational model tasks. Unlike traditional residual connections, which offer stability but limited expression, or HC, which enhances connectivity at the expense of stability and efficiency, mHC constrains the parameter space of HC to a specific manifold to re-establish the identity mapping structure.

Inspired by the identity mapping principle, mHC constrains the residual mapping on a specific manifold. While original identity mapping ensures training stability by forcing a specific condition, it can limit information interaction within the residual stream. The authors propose projecting the residual mapping onto a manifold that maintains signal propagation stability across layers while fostering interaction between residual streams.

A semi-abstract visualization of a doubly stochastic matrix, representing balanced connections.
A semi-abstract visualization of a doubly stochastic matrix, representing balanced connections.

To achieve this, the residual mapping is constrained to be a doubly stochastic matrix, meaning its elements are non-negative, and the sum of elements in each row and column equals 1. This constraint offers theoretical properties for large-scale model training, including norm preservation (spectral norm upper bound of 1, mitigating gradient explosion), compositional closure (maintaining stability across multiple layers), and a geometric interpretation via the Birkhoff polytope (viewing the residual mapping as a convex combination of permutations).

Implementation and Optimization

The calculation process for mHC involves flattening the input hidden matrix into a vector to preserve contextual information, followed by obtaining dynamic and static mappings. The final constrained mapping is derived using the Sigmoid function and the Sinkhorn–Knopp operator, which ensures elements are positive through exponentiation and then performs alternating iterative normalization to achieve row and column sums of 1. In experiments, 20 iterations were used as a practical approximation.

DeepSeek implemented mHC (with n=4) into large-scale models, achieving the reported training overhead increase of approximately 6.7% through engineering optimizations. These optimizations included kernel fusion, where RMSNorm operations on high-dimensional hidden states were reordered to occur after matrix multiplication to improve efficiency. A mixed-precision strategy was also adopted, and multiple operators with shared memory access patterns were fused into a unified computational kernel to reduce memory bandwidth bottlenecks.

To address the memory overhead of the n-way residual structure during training, the authors implemented recomputation. Intermediate activations generated by the mHC kernel are discarded after the forward pass and recomputed during the backward pass by re-executing the mHC kernel (excluding the computationally intensive layer function F). This approach minimizes the storage of intermediate activations.

In large-scale training, pipeline parallelism is used to manage parameter and gradient memory. DeepSeek extended the DualPipe scheduling strategy to overlap inter-node communication traffic more efficiently. To mitigate communication latency introduced by the n-stream residual structure and the computational overhead of recomputing the mHC kernel at stage boundaries, the kernel of the MLP layer was executed on an independent high-priority computational stream. Additionally, long-running persistent kernels were avoided in the attention layer to prevent stalls, allowing for more flexible scheduling and high utilization of processing units. The recomputation process is decoupled from pipeline communication dependencies as initial activations are cached locally.

Close-up of a high-performance server rack in a data center, with glowing lights.
Close-up of a high-performance server rack in a data center, with glowing lights.

Experimental Outcomes

DeepSeek's team evaluated the training stability and convergence of the 27B model. Results indicate that mHC mitigated training instability observed in HC, reducing loss by 0.021 compared to the baseline. Gradient norm analysis further confirmed improved stability, showing mHC to be comparable to the baseline and significantly more stable than HC.

Across various benchmarks, mHC reportedly improved downstream performance, outperforming the baseline on all tasks and HC on most. The architecture enhanced the model's reasoning capabilities, achieving a 2.1% performance improvement on BBH and a 2.3% improvement on DROP.

The scalability of mHC was also assessed, demonstrating that its performance advantage is maintained even with higher computational budgets, with only a slight decay. Internal large-scale training experiments further supported these findings.

Analysis of propagation stability showed that while the Sinkhorn-Knopp algorithm's limited iterations (20 in experiments) caused a slight deviation from the ideal doubly stochastic constraint, the backward gradient gain remained bounded, with a maximum value of approximately 1.6 for composite mappings. This contrasts with HC, where the maximum gain reached nearly 3000, indicating that mHC significantly enhances propagation stability by reducing the maximum gain by three orders of magnitude.