Inference is giving AI chip startups a second chance to make their mark
In a disaggregated AI world, Nvidia can be both a friend and an enemy
by Tobias Mann · The RegisterAI adoption is reaching an inflection point as the focus shifts from training new models to serving them. For the AI startups vying for a slice of Nvidia's pie, it's now or never.
Compared to training, inference is a much more diverse workload, which presents an opportunity for chip startups to carve out a niche for themselves. Large batch inference requires a different mix of compute, memory, and bandwidth than an AI assistant or code agent.
Because of this, inference has become increasingly heterogeneous, certain aspects of which may be better suited to GPUs and other more specialized hardware.
Nvidia's $20 billion acquihire of Groq back in December is a prime example. The startup's SRAM-heavy chip architecture meant that, with enough of them, Groq's LPUs could churn out tokens faster than any GPU. However, their limited compute capacity and aging chip tech meant they couldn't scale all that efficiently.
Nvidia side stepped this problem by moving the compute heavy prefill bit of the inference pipeline to its GPUs while it kept the bandwidth-constrained decode operations on its shiny new LPUs.
This combination isn't unique to Nvidia. The week after GTC, AWS announced a disaggregated compute platform of its own that used its custom Trainium accelerators for prefill and Cerebras Systems' dinner-plate sized wafer-scale accelerators for decode.
Even Intel has gotten in on the fun, announcing a reference design that'll use GPUs — presumably the one they teased last northern hemisphere fall — for prefill and AI chip startup SambaNova's new RDUs for decode.
So far, most of the AI chip startups' wins have been on the decode side of the equation. SRAM, while not particularly capacious, is stupendously fast. So with enough chips, or at least a big enough chip in the case of Cerebras, they're well suited to accelerating decode operations, but chip startups aren't limited to this regime.
This week, Lumai detailed its optical inference accelerator, which uses light, rather than electrons, to perform the matrix multiplication operations at the heart of most machine learning workloads using a fraction of the power of a purely digital architecture.
Lumai expects its next-gen Iris Tetra systems will achieve an exaOPS of AI performance in a 10kW power budget by 2029.
Technically, the chips use hybrid electro-optical architecture, but the bulk of the compute done during inference is handled by the chip's optical tensor core.
Initially, the company is positioning the chip as a standalone alternative to GPUs for compute-bound inference workloads, such as batch processing. Longer-term, the company also plans to use its optical accelerators as prefill processors.
The architecture is still in its infancy, capable of running billion parameter models like Llama 3.1 8B or 70B today, but it's far enough along that the UK-based startup has opened its chips up to neoclouds and hyperscalers for evaluation.
Having said that, not every AI chip startup is keen on using different chips for prefill and decode. Earlier this week Tenstorrent unveiled its RISC-V-based Galaxy Blackhole compute platforms, and suffice to say the company's CEO Jim Keller isn't a fan of the disaggregated inference formula.
"Every company in the industry is pairing up to build the accelerator accelerator accelerator. CPUs run code. GPUs accelerate CPUs. TPUs accelerate GPUs. LPUs accelerate TPUs. And so on. This leads to complex solutions which are unlikely to be compatible with changes in AI models and uses. At Tenstorrent, we thought something more general and simpler would work," he said in a statement. ®