TrendForce News operates independently from our research team, curating key semiconductor and tech updates to support timely, informed decisions.
News
While local media spotlight Huawei’s push to cut China’s HBM dependency for AI inference, the tech giant made waves on August 12th with the launch of UCM (Unified Computing Memory)—an AI inference breakthrough that slashes latency and costs while turbocharing efficiency, according to mydrivers and Securities Times.
Notably, the reports suggest Huawei will open-source UCM in September 2025, launching first on MagicEngine community before contributing to mainstream inference engines and sharing with Share Everything storage vendors and ecosystem partners.
UCM’s Game-Changing Features
Securities Times, citing Jason Cao, CEO of Huawei Digital Finance, notes that high latency and high costs remain the primary challenges facing AI inference development today. As the report points out, leading international models currently achieve single-user output speeds of 200 tokens per second (5ms latency), while China’s models typically fall below 60 tokens per second (50-100ms latency).
As per the reports, Huawei describes UCM as an AI inference acceleration toolkit centered on KV (Key Value) Cache technology. The system are said to be combining multiple cache optimization algorithms to intelligently manage KV Cache memory data produced during AI processing. This method expands inference context windows, achieving high-throughput, low-latency performance while lowering per-token inference costs, the reports add.
Securities Times reports that UCM automatically distributes cached data across HBM, DRAM, and SSD storage based on memory heat patterns. By combining multiple sparse attention algorithms, the system reportedly optimizes computing and storage coordination, delivering 2-22x higher TPS (tokens per second) in long-sequence scenarios while cutting per-token costs.
On the other hand, Huawei officials cited by the report explain that in multi-turn conversations and knowledge search applications, the system directly accesses previously stored data instead of recalculating everything, reducing initial response delays by up to 90%.
Less HBM Dependency
As per EETimes China, Huawei’s new technology not only boosts AI inference efficiency but could also reduce reliance on HBM memory, enhancing domestic AI large-model inference performance and strengthening China’s AI inference ecosystem.
EETimes China notes that starting January 2, 2025, the U.S. bans exports of HBM2E and higher-grade HBM chips to China. This ban covers not only HBM chips made in the U.S. but also those produced overseas using American technology.
Huawei’s breakthroughs in AI inference are not new. According to the report, the company has achieved multiple milestones, including the DeepSeek open-source inference solution developed with Peking University and several performance improvements on its Ascend platform. Additionally, Huawei’s partnership with iFlytek has delivered notable results, enabling large-scale expert distribution for MoE (Mixture of Experts) models on domestic computing infrastructure, tripling inference speed while cutting response delays by half, the report adds.
Read more
(Photo credit: Huawei)