Models
NVIDIA Nemotron 3 Ultra (2026): 550B Agent Model, Benchmarks & Setup
Nemotron 3 Ultra (2026): 550B open agent model, 1M context, benchmarks, free OpenRouter routes, and who should self-host vs use the API.
NVIDIA shipped Nemotron 3 Ultra on June 4, 2026. It is a 550B-parameter open model with 55B active experts, built for multi-step agents that plan, call tools, spawn sub-agents, and keep context across hundreds of turns. This page explains what changed, what the benchmark numbers mean in plain English, and how to try it without renting a GPU cluster by accident.
For the wider June 2026 model map, see Latest AI Models Compared (2026). For free API routing, see OpenRouter Free Models (2026).
Last updated: June 5, 2026. Live on aitoolsradar.org.
Quick specs
| Spec | Nemotron 3 Ultra (NVIDIA-Nemotron-3-Ultra-550B-A55B) |
|---|---|
| Total parameters | 550B (55B active per token) |
| Architecture | LatentMoE hybrid: Mamba-2 + MoE + selective Attention + Multi-Token Prediction (MTP) |
| Context | Up to 1M tokens (many runtimes default to 256K until you raise limits) |
| Precision | NVFP4 production checkpoint; BF16 variant for research |
| License | OpenMDW-1.1 |
| Release | 2026-06-04 (build.nvidia.com + Hugging Face) |
| Languages | English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese |
| Min self-host GPUs | 4x B200 / GB200 (NVFP4) or 8x H100 per NVIDIA model card |
| Best for | Agent orchestration, long-document RAG, coding agents, safety guardrails |
| Watch out for | Hardware cost, latency on free tiers, thinking mode token overhead |

Why Nemotron 3 Ultra exists (in one paragraph)
Chatbots that answer once are giving way to agents that run for minutes or hours. Each step adds more tokens: plans, tool JSON, stderr logs, retrieved docs, and sub-agent replies. NVIDIA’s pitch is not only “smarter answers” but cheaper long jobs: Ultra is post-trained on tool-heavy RL environments, uses MTP to draft multiple tokens per step, and ships NVFP4 so one checkpoint runs on Hopper and Blackwell without maintaining separate weight files.
It is aimed at long agent workflows and ships a configurable reasoning mode in the chat template.
Nemotron 3 family: Ultra, Super, Nano, and the June 4 extras
Nemotron 3 is a family, not one download.
| Model | Scale (total / active) | Role |
|---|---|---|
| Nemotron 3 Ultra | 550B / 55B | Frontier orchestration, hard reasoning, 1M context |
| Nemotron 3 Super | 120B / 12B | Strong open model that is easier to host; common :free OpenRouter slug |
| Nemotron 3 Nano | Smaller edge variants | On-device and high-volume routing |
| Nemotron 3.5 Content Safety | 4B guardrail | Policy classification across text + image |
| Nemotron 3.5 ASR | 0.6B streaming | Multilingual speech for voice agents |
Nemotron 3 Super (120B) shipped in the same family and is easier to host. Treat Ultra as the planner for hard steps and Super as the workhorse for bulk tool calls unless a specific benchmark is your only decision factor.
Architecture without the jargon wall
LatentMoE (Mixture of Experts, compressed)
Classic MoE models route full-width vectors to experts. LatentMoE projects tokens into a smaller latent space before routing. NVIDIA claims better accuracy per byte moved across the GPU mesh. You still get 550B total capacity, but only 55B fire per token, which keeps inference feasible on a small number of Blackwell nodes.
Mamba-2 + Attention hybrid
Mamba-2 layers handle long sequences efficiently. Attention layers sit where precise recall matters (needle-in-haystack facts inside a 1M-token repo). The blend is why NVIDIA quotes 94.7% on RULER at 1M in BF16 mode: the model is tuned for “find paragraph 17,482” style tasks, not only chat polish.
Multi-Token Prediction (MTP)
MTP heads predict several future tokens per forward pass. Training uses a shared-weight design; inference enables speculative decoding (vLLM nemotron_h_mtp with five draft tokens in the official recipe). Plain English: agent loops that emit long tool arguments or code blocks should finish faster per wall-clock second when MTP is enabled.
NVFP4 everywhere it is safe
Weights, activations, and gradients use NVFP4 during pre-training where stable. Sensitive layers (embeddings, QKV, MTP) stay in BF16 or MXFP8. The result is one NVFP4 checkpoint that NVIDIA says runs up to 5x higher throughput than BF16 on Blackwell at similar interactivity. That is a vendor claim; always benchmark your own agent harness.
Thinking mode
Set enable_thinking=True in the chat template. The model emits a reasoning trace, then the user-facing answer. Agent frameworks must parse both streams (vLLM --reasoning-parser nemotron_v3). For production chat UIs that do not show chain-of-thought, turn thinking off to save tokens.
Benchmarks (vendor table, June 2026)
We did not rerun these suites. Numbers come from the build.nvidia.com model card and the NVIDIA technical blog. Use them to see strengths, not to crown a single winner.
| Area | Benchmark | Nemotron 3 Ultra (BF16) | Plain meaning |
|---|---|---|---|
| Coding | SWE-Bench Verified | 71.9% | Can it fix real GitHub issues end-to-end? |
| Coding | Terminal Bench 2.1 | 56.4% | Can it drive a shell like a human dev? |
| Agents | PinchBench | 90.0% | Multi-tool productivity-style tasks |
| Agents | Tau-Bench v3 (avg) | 70.9% | Customer-service style tool sims |
| Knowledge | GPQA (no tools) | 87.0% | Hard science multiple choice |
| Long context | RULER @ 1M | 94.7% | Retrieval across million-token windows |
| Long context | AA-LCR | 65.4% | Long-document aggregation |
| Instruction | IFBench | 81.7% | Following fiddly prompt constraints |
NVIDIA also publishes competitive tables against GLM 5.1, Kimi K2.6, and Qwen3.5 on agent productivity (PinchBench) and long-context RULER. Ultra leads or ties several agent scores while advertising lower cost per completed task on SWE-Bench style runs. Treat cross-vendor tables as launch marketing until independent labs reproduce them.
Who should use Ultra vs Super vs closed APIs
| You are… | Likely path |
|---|---|
| Indie dev testing ideas | OpenRouter nvidia/nemotron-3-ultra-550b-a55b:free or build.nvidia.com playground |
| Startup shipping agents | Ultra for planner steps, Super for bulk tool calls, closed model as fallback |
| Enterprise with Blackwell pods | Self-host NVFP4 with vLLM or TensorRT-LLM cookbooks |
| Regulated on-prem only | Download weights + OpenMDW license review with legal |
| Need best single-shot coding score | Benchmark your repo; Ultra is strong but not always #1 |
How to try it today (no cluster required)
1. NVIDIA build.nvidia.com (fastest UI)
Sign in, open the Nemotron 3 Ultra model page, and run prompts in the hosted playground. Trial terms apply via NVIDIA API Trial Terms of Service.

2. OpenRouter (API key, fits existing agents)
Route any OpenAI-compatible client to OpenRouter. Model slug: nvidia/nemotron-3-ultra-550b-a55b. Add :free while the promotion lasts (see our OpenRouter free models guide).

3. Hugging Face weights (self-host or fine-tune)
Primary production artifact: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4. BF16 base and post-training datasets live in NVIDIA’s Nemotron v3 collections for reproducibility.
4. Partner hosts (day-zero list)
NVIDIA’s launch blog names Perplexity Pro, Together AI, Fireworks, Baseten, Modal, CoreWeave, Amazon SageMaker JumpStart, Microsoft Foundry, and others. Pick the host that matches your compliance region and existing contract.
Self-hosting snapshot (experienced teams only)
If you manage bare metal, start from NVIDIA’s vLLM v0.22.0 container recipe:
- Single node: 4x B200, tensor parallel 4, expert parallel on, FP8 KV cache, MTP speculative config with five tokens.
- Context: Default recipes use 256K
max-model-len. SetVLLM_ALLOW_LONG_MAX_MODEL_LEN=1and1048576only if you truly need 1M and have the RAM. - Tool calls: Enable auto tool choice with
qwen3_coderparser per NVIDIA docs (same family as several Chinese open models). - Multi-node: Ray head + workers, pipeline parallel 2, distributed executor backend
ray.
SGLang and TensorRT-LLM cookbooks ship the same day with parallel flags. Unless you already run NeMo clusters, do not start here. Use OpenRouter or NIM first.
Training and openness (why enterprises care)
NVIDIA positions Nemotron as open weights + open data + open recipes:
- Pre-training: ~20T tokens, NVFP4 recipe, disclosed corpora on Hugging Face.
- SFT: Synthetic code, math, tool-calling, long-document aggregation sets.
- RL: Async GRPO across NeMo Gym environments (math, code, multi-turn tools).
- MOPD: Multi-Teacher On-Policy Distillation from 10+ domain teachers that score the student’s own rollouts.
OpenMDW-1.1 replaces older Nemotron license text so legal teams have one framework for weights, code, and docs. That matters if you fine-tune on private data and redistribute adapters.
Agent frameworks NVIDIA highlights
Ultra is trained for harness diversity, not a single IDE:
- Hermes Agent, OpenClaw, OpenHands, CrewAI, LangChain Deep Agents, Pi, Cline, Factory, OpenCode, and more list Nemotron in June 2026 docs.
- NemoClaw plus OpenShell bundle a safer runtime story for always-on agents (early preview in launch week).
A common production pattern: Ultra plans, a smaller model executes bash or SQL, Ultra verifies before merge. That mirrors how teams use Claude Opus with Haiku, but with open weights on the orchestrator.
Cost and latency: what “5x throughput” and “30% cheaper tasks” mean
NVIDIA cites 5x output speed vs other open frontier models on Artificial Analysis style charts (Blackwell endpoints). Separately, they claim ~30% lower cost to finish SWE-Bench Verified style agent jobs because Ultra uses fewer tokens per turn. Your mileage depends on:
- Whether thinking mode is on.
- How chatty your tool schema is.
- If MTP and FP8 KV cache are enabled in vLLM.
- Batch size and concurrent agents on the same GPU.
Run an A/B with your real Jira tickets, not a demo prompt.
Nemotron 3 Ultra vs Nemotron 3 Super
| Question | Ultra | Super |
|---|---|---|
| Parameter count | 550B / 55B active | 120B / 12B active |
| Context | Up to 1M | Check Super card (shorter in most hosts) |
| Hosting | Data-center Blackwell class | Feasible on fewer GPUs; popular :free route |
| Best fit | Orchestration, 1M RAG, hardest agent steps | Daily tool calls, cheaper API, local experiments |
If you only need quick classification or shallow tool calls, Super is the economic default. Ultra is for when the agent must hold an architecture decision across fifty prior steps.
Limitations and honest caveats
- Hardware wall: There is no realistic “run Ultra on a gaming GPU” path. Plan cloud APIs unless you operate Blackwell racks.
- Free tiers throttle: OpenRouter
:freemodels can queue or rate-limit. Do not benchmark SLA-sensitive prod flows on free routes alone. - Banking sim weakness: Tau-Bench banking scores in NVIDIA’s own table are low (~22%). Do not assume finance agent readiness without private evals.
- License still new: OpenMDW is clearer than many custom licenses, but your compliance team must still approve redistribution.
- Benchmark marketing: Launch-week tables cherry-pick friendly harnesses. Replicate on your stack.
Same-day siblings: safety and voice
June 4 also dropped Nemotron 3.5 Content Safety (4B, 23 categories, 12 languages) and Nemotron 3.5 ASR (40+ languages, sub-100 ms streaming). Ultra does not replace those. Use Safety as a guardrail model in front of Ultra, and ASR if you build voice-native agents.
Bottom line
Nemotron 3 Ultra is NVIDIA’s open bet that agent orchestration needs a frontier model you can host, fine-tune, and audit, with 1M context and NVFP4 speed on Blackwell. It is not the cheap daily driver. Pair it with Nemotron 3 Super, OpenRouter free routes, or your existing closed API for grunt work.
Try first: build.nvidia.com playground or OpenRouter free slug. Self-host when: you already run 4x B200 and need data residency. Skip when: you only need inline code completion in an IDE and no agent loop.
Changelog
- 2026-06-05: First publish. Benchmarks, API and OpenRouter access, and when to self-host vs use Nemotron 3 Super.
Frequently asked
8 questionsWhat is NVIDIA Nemotron 3 Ultra?
Nemotron 3 Ultra is NVIDIA's frontier open model released June 4, 2026. It has 550 billion total parameters with 55 billion active per forward pass (Mixture-of-Experts). It targets long-running agents, tool use, coding, and reasoning up to 1 million tokens of context. Weights, training data, and recipes ship under OpenMDW-1.1.
Is Nemotron 3 Ultra free?
You can try it free on build.nvidia.com (trial terms apply) and on OpenRouter with the nvidia/nemotron-3-ultra-550b-a55b:free slug while NVIDIA's promotion lasts. Self-hosting is not free. You need roughly four B200 GPUs or eight H100s for the NVFP4 checkpoint.
How does Nemotron 3 Ultra compare to Nemotron 3 Super?
Ultra is the 550B orchestration model for hard agent steps and long context. Super is the 120B sibling (12B active) that is cheaper and easier to run locally or on a single high-end node. Many teams route easy tool calls to Super and hard planning to Ultra.
What hardware do I need to run Nemotron 3 Ultra locally?
NVIDIA lists a minimum of 4x GB200, 4x B200, 4x GB300, 4x B300, or 8x H100 for the NVFP4 weights plus KV cache. Single-node vLLM recipes target 4x B200. Multi-node Ray setups are recommended beyond that.
Does Nemotron 3 Ultra support thinking mode?
Yes. The chat template exposes enable_thinking=True or False. With thinking on, the model writes a reasoning trace before the final answer. Agent hosts like vLLM use the nemotron_v3 reasoning parser. Turn thinking off for faster chat-style replies.
Where can I download Nemotron 3 Ultra weights?
Hugging Face hosts BF16 and NVFP4 checkpoints, including nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 for production inference. NVIDIA also publishes base, post-training datasets, and NeMo cookbooks on GitHub.
Is Nemotron 3 Ultra good for coding agents?
NVIDIA reports 71.9% on SWE-Bench Verified (BF16) and 56.4% on Terminal Bench 2.1. That is strong for an open weight model, though some rivals score higher on raw coding leaderboards. Ultra's pitch is fewer tokens per agent turn and better long-horizon planning, not only raw patch accuracy.
Nemotron 3 Ultra vs Claude or GPT for agents?
Closed models still win many one-shot coding benches. Ultra is aimed at teams that want open weights, on-prem deploy, or a mix of frontier orchestration plus cheaper worker models. Pair Ultra on hard steps with Nemotron 3 Super or another small model on bulk tool calls to control cost.
More in Models
View all
GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)
GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.
Models

Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing
Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.
Models

MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks
MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.
Models
More stories
View all
US Government Blocks Anthropic Fable 5 & Mythos 5 (2026)
US government ban on Anthropic: Commerce Dept ordered suspension of Fable 5 & Mythos 5 on June 12, 2026. Full timeline of the 4-month feud.
Models

Siri AI Review (2026): Apple's Rebuilt Assistant vs ChatGPT & Gemini [Tested]
Siri AI is Apple's rebuilt assistant for 2026. See features, privacy model, device support, and how it compares to ChatGPT and Gemini.
Review

Claude Fable 5 Release (2026): Anthropic's Most Powerful AI Model Explained
Claude Fable 5 is the first Mythos-class model available to the public. State-of-the-art coding, vision, and knowledge work with new safeguards. Pricing, benchmarks, and what it means.
Models

Ideogram AI Review (2026): Free Tier Tested, vs Midjourney & Recraft
Ideogram AI review (2026): we tested free tier, pricing, text rendering, and Ideogram 4.0 vs Midjourney and Recraft. Who should use it?
Review