Twelve billion active parameters from a total of one hundred twenty billion. One million token context window. Open source. NVIDIA just set new standards for agentic AI systems with Nemotron 3 Super â a model that tackles the long-running tasks of AI agents without the "thinking tax" that cripples existing solutions.
đ Read more: How Google's Gemini Surpassed ChatGPT
đŹ What Makes Nemotron 3 Super Different?
Work with multi-agent systems and you'll quickly see where it hurts: they generate 15x more tokens than simple chat, constantly re-sending history, tool outputs, and reasoning steps with every turn. This "context explosion" causes goal drift â agents gradually lose alignment with their original objective. Nemotron 3 Super tackles these challenges with hybrid mixture-of-experts (MoE) architecture that delivers over 5x throughput compared to the previous Nemotron Super. Native support for 1 million tokens gives agents long-term memory for aligned, high-accuracy reasoning. Sure, NVIDIA has made bold claims before. How genuinely impressive are these numbers?⥠Latent MoE: More Experts, Same Cost
Typical MoE architecture routes tokens directly from the model's full hidden dimension to experts. As models grow, this routing layer becomes a bottleneck. Super introduces latent MoE: before routing decisions, token embeddings project into compressed, low-rank latent space. Expert computation happens in this smaller dimension, with results projected back to full model dimension.Practical result: 4x more experts for the same computational cost. That means finer-grained specialization â activating distinct experts for Python syntax versus SQL logic â triggered only when absolutely necessary.
đ§Ź Hybrid Mamba-Transformer: Best of Both Worlds
The model's backbone interleaves three layer types: **Mamba-2 layers** handle the majority of sequence processing. State space models (SSMs) offer linear-time complexity with respect to sequence length â this makes the 1M-token context window practical instead of theoretical. When an agent needs reasoning over entire codebases, long conversation history, or stacks of retrieved documents, Mamba layers keep memory footprint manageable. **Transformer attention layers** are interleaved at critical depths. Pure SSMs can struggle with precise associative recall â the type of task where you need to find a specific fact buried in long context. **MoE layers** scale effective parameter count without dense computation cost. Only a subset of experts activates per token, keeping latency low and throughput high. The architecture looks clever in theory. But in practice?đ Multi-Token Prediction: Built-in Speed
Standard language models train to predict one token at a time â a fundamentally myopic objective. Super trains with Multi-Token Prediction (MTP), where specialized prediction heads simultaneously predict several future tokens from each position.3x Wall-clock speedup for structured generation tasks
478 Output tokens per second (fastest in class)
đ Read more: GPT-5.3-Codex-Spark: AI Coding at 1000+ Tokens/Second
đŻ Native NVFP4: Training at 4-bit
Most quantized models start as full-precision and compress after training, inevitably introducing accuracy loss. Super takes a different approach: the majority of floating-point multiply-accumulate operations during pretraining run in NVFP4, NVIDIA's 4-bit floating-point format. Optimized for Blackwell, this significantly cuts memory requirements and accelerates inference by 4x on NVIDIA B200 compared to FP8 on NVIDIA H100, while maintaining accuracy. Training natively at reduced precision means the model learns to be accurate within 4-bit arithmetic constraints from the first gradient update.đ Benchmarks: How It Stacks Against Competition
On PinchBench â a new benchmark for determining how well LLM models perform as the brain of an OpenClaw agent â Nemotron 3 Super scores 85.6% across the test suite, making it the best open model in its class.Intelligence Score
36 on Artificial Analysis Intelligence Index, ahead of gpt-oss-120b (33) but behind Qwen3.5 122B A10B (42)
Openness Rating
83 on Artificial Analysis Openness Index â open weights, datasets, and training recipes
Throughput Champion
11% higher throughput per GPU than gpt-oss-120b, 40% higher than Qwen3.5 122B A10B
đ ïž Availability and Practical Use
Nemotron 3 Super is available now on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face. Enterprises can also access it through Google Cloud's Vertex AI, Oracle Cloud Infrastructure, and soon Amazon Bedrock and Microsoft Azure. For those with the necessary hardware, NVIDIA makes it available as a NIM. The fact that it's currently free on OpenRouter means many will test it with their own workflows.The company reports the model is ideal for DGX Spark or NVIDIA RTX PRO workstations, thanks to 128GB unified memory supporting models with over 120 billion parameters.Multi-agent systems generate up to 15x the tokens of standard chats, re-sending history, tool outputs, and reasoning steps at every turn.
NVIDIA Developer Blog
