Nvidia fills the void of American open-weights models with some of its own

Nemotron 3 is a grab bag of 2025's top machine learning advancements

16 Dec 2025, 19:31 by Tobias Mann · The Register

For many, enterprise AI adoption depends on the availability of high-quality open-weights models. Exposing sensitive customer data or hard-fought intellectual property to APIs so you can use closed models like ChatGPT is a non-starter.

Outside of Chinese AI labs, the few open-weights models available today don't compare favorably to the proprietary models from the likes of OpenAI or Anthropic.

This isn't just a problem for enterprise adoption; it's a roadblock to Nvidia's agentic AI vision that the GPU giant is keen to clear. On Monday, the company added three new open-weights models of its own design to its arsenal.

Open-weights models are nothing new for Nvidia — most of the company's headcount is composed of software engineers. However, its latest generation of Nemotron LLMs is by far its most capable and open.

When they launch, the models will be available in three sizes, Nano, Super, and Ultra, which weigh in at about 30, 100, and 500 billion parameters, respectively.

In addition to the model weights, which will roll out on popular AI repos like Hugging Face over the next few months beginning with Nemotron 3 Nano this week, Nvidia has committed to releasing training data and the reinforcement learning environments used to create them, opening the door to highly customized versions of the models down the line.

The models also employ a novel "hybrid latent MoE" architecture designed to minimize performance losses when processing long input sequences, like ingesting large documents and processing queries against them.

This is possible using a combination of the Mamba-2 and Transformer architectures throughout the model's layers. Mamba-2 is generally more efficient than transformers when processing long sequences, which results in shorter prompt processing times and more consistent token generation rates.

Nvidia says that it's using transformer layers to maintain "precise reasoning" and prevent the model from losing context of the relevant information, a known challenge when ingesting long documents or keeping track of details over extended chat sessions.

Speaking of which, these models natively support a million token context window — the equivalent of roughly 3,000 double spaced pages of text.

All of these models employ a mixture-of-experts (MoE) architecture, which means only a fraction of the total parameter count is activated for each token processed and generated. This puts less pressure on the memory subsystem, resulting in faster throughput than an equivalent dense model on the same hardware.

For example, Nemotron 3 Nano has 30 billion parameters but only 3 billion are activated for each token generated.

While the nano model employs a pretty standard MoE architecture not unlike those seen in gpt-oss or Qwen3-30B-A3B, the larger Super and Ultra models were pretrained using Nvidia's NVFP4 data type and use a new latent MoE architecture.

As Nvidia explains it, using this approach, "experts operate on a shared latent representation before outputs are projected back to token space. This approach allows the model to call on 4x more experts at the same inference cost, enabling better specialization around subtle semantic structures, domain abstractions, or multi-hop reasoning patterns."

Finally, these models have been engineered to use "multi-token prediction," a spin on speculative decoding, which we've explored in detail here, that can improve inference performance by up to 3x by predicting future tokens each time a new one is generated. Speculative decoding is particularly useful in agentic applications where large quantities of information are repeatedly processed and regenerated, like code assistants.

Nvidia's 30-billion-parameter Nemotron 3 Nano is available this week, and is designed to run efficiently on enterprise hardware like the vendor's L40S or RTX Pro 6000 Server Edition. However, using 4-bit quantized versions of the model, it should be possible to cram it into GPUs with as little as 24GB of video memory.

According to Artificial Analysis, the model delivers performance on par with models like gpt-oss-20B or Qwen3 VL 32B and 30B-A3B, while offering enterprises far greater flexibility for customization.

One of the go-to methods for model customization is reinforcement learning (RL), which enables users to teach the model new information or approaches through trial and error, where desirable outcomes are rewarded while undesirable ones are punished. Alongside the new models, Nvidia is releasing RL-datasets and training environments, which it calls NeMo Gym, to help enterprises fine-tune the models for their specific application or agentic workflows.

Nemotron 3 Super and Ultra are expected to make their debut in the first half of next year. ®