NVIDIA Nemotron-3 Super: 120B-Parameter AI Model for Agents

Twelve billion active parameters from a total of one hundred twenty billion. One million token context window. Open source. NVIDIA just set new standards for agentic AI systems with Nemotron 3 Super — a model that tackles the long-running tasks of AI agents without the "thinking tax" that cripples existing solutions.

📖 Read more: How Google's Gemini Surpassed ChatGPT

🔬 What Makes Nemotron 3 Super Different?

Work with multi-agent systems and you'll quickly see where it hurts: they generate 15x more tokens than simple chat, constantly re-sending history, tool outputs, and reasoning steps with every turn. This "context explosion" causes goal drift — agents gradually lose alignment with their original objective. Nemotron 3 Super tackles these challenges with hybrid mixture-of-experts (MoE) architecture that delivers over 5x throughput compared to the previous Nemotron Super. Native support for 1 million tokens gives agents long-term memory for aligned, high-accuracy reasoning. Sure, NVIDIA has made bold claims before. How genuinely impressive are these numbers?

⚡ Latent MoE: More Experts, Same Cost

Typical MoE architecture routes tokens directly from the model's full hidden dimension to experts. As models grow, this routing layer becomes a bottleneck. Super introduces latent MoE: before routing decisions, token embeddings project into compressed, low-rank latent space. Expert computation happens in this smaller dimension, with results projected back to full model dimension.

Practical result: 4x more experts for the same computational cost. That means finer-grained specialization — activating distinct experts for Python syntax versus SQL logic — triggered only when absolutely necessary.

🧬 Hybrid Mamba-Transformer: Best of Both Worlds

The model's backbone interleaves three layer types: **Mamba-2 layers** handle the majority of sequence processing. State space models (SSMs) offer linear-time complexity with respect to sequence length — this makes the 1M-token context window practical instead of theoretical. When an agent needs reasoning over entire codebases, long conversation history, or stacks of retrieved documents, Mamba layers keep memory footprint manageable. **Transformer attention layers** are interleaved at critical depths. Pure SSMs can struggle with precise associative recall — the type of task where you need to find a specific fact buried in long context. **MoE layers** scale effective parameter count without dense computation cost. Only a subset of experts activates per token, keeping latency low and throughput high. The architecture looks clever in theory. But in practice?

📊 Multi-Token Prediction: Built-in Speed

Standard language models train to predict one token at a time — a fundamentally myopic objective. Super trains with Multi-Token Prediction (MTP), where specialized prediction heads simultaneously predict several future tokens from each position.

3x Wall-clock speedup for structured generation tasks

478 Output tokens per second (fastest in class)

This has two concrete benefits: stronger reasoning during training, since the model must internalize longer-range structure and logical dependencies. And built-in speculative decoding at inference — predicting multiple future tokens simultaneously in one forward pass.

🎯 Native NVFP4: Training at 4-bit

Most quantized models start as full-precision and compress after training, inevitably introducing accuracy loss. Super takes a different approach: the majority of floating-point multiply-accumulate operations during pretraining run in NVFP4, NVIDIA's 4-bit floating-point format. Optimized for Blackwell, this significantly cuts memory requirements and accelerates inference by 4x on NVIDIA B200 compared to FP8 on NVIDIA H100, while maintaining accuracy. Training natively at reduced precision means the model learns to be accurate within 4-bit arithmetic constraints from the first gradient update.

🏆 Benchmarks: How It Stacks Against Competition

On PinchBench — a new benchmark for determining how well LLM models perform as the brain of an OpenClaw agent — Nemotron 3 Super scores 85.6% across the test suite, making it the best open model in its class.

Intelligence Score

36 on Artificial Analysis Intelligence Index, ahead of gpt-oss-120b (33) but behind Qwen3.5 122B A10B (42)

Openness Rating

83 on Artificial Analysis Openness Index — open weights, datasets, and training recipes

Throughput Champion

11% higher throughput per GPU than gpt-oss-120b, 40% higher than Qwen3.5 122B A10B

Compared to similar-sized models, Nemotron 3 Super leads gpt-oss-120b but trails the recently-released Qwen3.5 122B A10B in intelligence. Where it excels is speed. The critical question isn't just whether it's fast, but whether that speed comes at accuracy's expense.

🛠️ Availability and Practical Use

Nemotron 3 Super is available now on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face. Enterprises can also access it through Google Cloud's Vertex AI, Oracle Cloud Infrastructure, and soon Amazon Bedrock and Microsoft Azure. For those with the necessary hardware, NVIDIA makes it available as a NIM. The fact that it's currently free on OpenRouter means many will test it with their own workflows.

Multi-agent systems generate up to 15x the tokens of standard chats, re-sending history, tool outputs, and reasoning steps at every turn.
NVIDIA Developer Blog

The company reports the model is ideal for DGX Spark or NVIDIA RTX PRO workstations, thanks to 128GB unified memory supporting models with over 120 billion parameters.

Super + Nano Deployment Pattern

Nemotron 3 Nano is an excellent choice for high accuracy on targeted, individual steps within agentic workflows. But when multi-agent applications scale to complex, multi-step activities, they need high-capacity models for superior planning and reasoning. Think of a computer use agent that must make decisions between different tool modalities to create, say, a presentation with 10 high-quality slides. Nemotron 3 Super is ideal for this use case.

🔮 The Future of Agentic AI

NVIDIA isn't stopping here. The same week it introduced NemoClaw, an open source stack for OpenClaw that optimizes OpenClaw experiences on NVIDIA devices by increasing security and supporting local models. There's still no word on Nemotron 3 Ultra — the largest model in the family at 500 billion parameters that NVIDIA teased in last year's initial announcement. We might see it at GTC running this week. What's becoming clear is that the race for efficient agentic AI has begun in earnest. The question is no longer whether we'll have AI agents running locally — it's how quickly they'll become smart enough to replace human workflows in real tasks. And at 478 output tokens per second, Super appears to take a serious step in that direction.

NVIDIA Nemotron 3 Super 120B parameters agentic AI open source AI MoE architecture local deployment transformer model

Sources:

NVIDIA Nemotron 3 Super: Revolutionary 120B Parameter Open-Source Model Redefines Agentic AI Performance