China seems to have found a clever workaround to the limitations imposed by NVIDIA’s “cut-down” AI accelerators. DeepSeek’s latest breakthrough reportedly delivers an impressive eightfold increase in TFLOPS using the Hopper H800s AI accelerators.
DeepSeek is making a bold statement in China’s tech landscape by focusing on smart software solutions to maximize existing hardware capabilities, and they’re definitely not waiting around for outside help. The enhancements introduced by DeepSeek are grabbing attention across the industry. The company has discovered how to significantly boost performance from NVIDIA’s adjusted Hopper H800 GPUs. By fine-tuning memory usage and optimizing the distribution of resources for different inference tasks, DeepSeek has managed to make quite a mark.
Recently, on the first day of their much-anticipated “OpenSource” week, DeepSeek rolled out FlashMLA, a cutting-edge decoding kernel tailored for NVIDIA’s Hopper GPUs. This initiative is part of an effort to make their advanced technology widely accessible through Github. FlashMLA’s debut has been met with great interest due to its impressive contributions to the tech space.
According to DeepSeek, FlashMLA has achieved a remarkable 580 TFLOPS for BF16 matrix multiplication on the Hopper H800, which is a staggering eightfold increase over the industry norm. In terms of memory management, FlashMLA is equally impressive, providing up to 3000 GB/s of memory bandwidth, which effectively doubles the theoretical capabilities of the H800. What’s truly fascinating is that all these advancements are driven purely by innovative coding, without the need for any physical hardware upgrades.
One of FlashMLA’s key innovations is “low-rank key-value compression.” In simpler terms, this technique breaks data down into smaller, more manageable pieces, significantly boosting processing speed and slashing memory usage by 40% to 60%. Additionally, the block-based paging system adapts memory allocation based on the workload intensity, as opposed to relying on a static allocation. This allows for more efficient processing of models with variable-length sequences, ultimately boosting overall performance.
DeepSeek’s work with FlashMLA underscores the multifaceted nature of AI computing. It’s not just about one component, but a variety of factors working in harmony. Currently, FlashMLA is designed specifically for Hopper GPUs, and there’s a lot of anticipation about how this tool might perform with the H100. It’s an exciting time to be watching these developments unfold.