DeepSeek says it has found a way to make AI 85 per cent faster, flagship chip not required
Chinese AI startup, DeepSeek, has found a way to not only make AI models faster, but without needing flagship AI chips. The startup has unveiled DSpark, a new framework, can potentially speed up responses of its flagship models by as much as 85 per cent. Here are the details.
by Armaan Agarwal · India TodayIn Short
- DeepSeek unveils DSpark framework that can make Ai respond faster
- Company says AI can get up to 85 per cent faster, without needing new flagship chips
- DeepSeek and other Chinese companies face US-imposed restrictions for flagship chips
As demand for AI models continues to grow, companies are facing a problem – computing resources. Data centres need thousands of the most-advanced GPUs out there to run. And for Chinese AI companies like DeepSeek, advanced AI chips from the likes of Nvidia are not accessible due to US sanctions. But now, the Chinese startup claims that it has found a way to not only make its AI models respond much faster, but without needing the most advanced chips to do so.
DeepSeek has unveiled DSpark – a speculative decoding framework for its flagship V4 model family. DeepSeek says that this can speed up AI responses by as much as 85 per cent. For instance, a single GPU that previously handled 100 user queries could process about 185.
Smaller AI model does the work, larger one verifies
DeepSeek claims that DSpark is aimed at speeding up AI inference – that is the time an AI model takes to respond to your query. AI inference is often seen as a major bottleneck in serving AI models.
AI models generate text one token at a time, which gets slow and wasteful with GPUs when responses are long. Tokens are the basic unit of measurement for AI models. The more work you do, the more tokens you consume.
DSpark addresses this with speculative decoding. According to the company, a lightweight draft model quickly proposes responses, and then the main model verifies them in batches rather than generating everything from scratch.
That is a smaller model does the work, which is then verified by the larger one. If the draft created by the smaller model is correct, the system skips ahead, and if it is not, it falls back. DeepSeek says that most tokens are easy to predict, so the system can often move ahead.
DeepSeek claims that this allows output to be generated faster. All of this happens on the GPU with no work being shifted to the CPU.
The framework also uses a semi-autoregressive generation method. That is instead of generating responses on a token-by-token basis, it can produce small chunks of tokens, making the process quicker.
More efficient way of using AI?
The company has open-sourced its DSpark research, a joint effort with Peking University, on GitHub and HuggingFace. DeepSeek says that DSpark does not improve a model's general capabilities, but it could help companies get better performance without large additional investment in computing resources.
The company tested the framework on several open-source models, including Google DeepMind's Gemma and Alibaba's Qwen, suggesting the gains could be applied more widely.
This could be crucial at a time when AI companies are spending billions of dollars to acquire more compute for data centres. At the same time, companies like Uber and Walmart are restricting AI token usage for employees due to rising costs.
In April this year, DeepSeek open-sourced V4 Preview, positioning the model family as a cost-effective option for handling 1 million-context inputs. DeepSeek said V4-Pro is aimed at higher performance, while V4-Flash is designed as a faster and cheaper option.
Do note that DeepSeek is not the only company working on improving AI response speed. Earlier this month, Xiaomi's AI team said it had raised the output speed of its MiMo-V2.5-Pro-UltraSpeed model to more than 1,000 tokens per second, among the fastest speeds in the industry.
- Ends