AI Tools Radar
中文
NVIDIA Nemotron 3 Ultra featured image with connected AI model nodes on AI Tools Radar

Models

NVIDIA Nemotron 3 Ultra (2026): 550B Agent Model, Benchmarks & Setup

Nemotron 3 Ultra (2026): 550B open agent model, 1M context, benchmarks, free OpenRouter routes, and who should self-host vs use the API.

AI Tools Radar Editorial 9 min read

NVIDIA shipped Nemotron 3 Ultra on June 4, 2026. It is a 550B-parameter open model with 55B active experts, built for multi-step agents that plan, call tools, spawn sub-agents, and keep context across hundreds of turns. This page explains what changed, what the benchmark numbers mean in plain English, and how to try it without renting a GPU cluster by accident.

For the wider June 2026 model map, see Latest AI Models Compared (2026). For free API routing, see OpenRouter Free Models (2026).

Last updated: June 5, 2026. Live on aitoolsradar.org.

Quick specs

SpecNemotron 3 Ultra (NVIDIA-Nemotron-3-Ultra-550B-A55B)
Total parameters550B (55B active per token)
ArchitectureLatentMoE hybrid: Mamba-2 + MoE + selective Attention + Multi-Token Prediction (MTP)
ContextUp to 1M tokens (many runtimes default to 256K until you raise limits)
PrecisionNVFP4 production checkpoint; BF16 variant for research
LicenseOpenMDW-1.1
Release2026-06-04 (build.nvidia.com + Hugging Face)
LanguagesEnglish, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese
Min self-host GPUs4x B200 / GB200 (NVFP4) or 8x H100 per NVIDIA model card
Best forAgent orchestration, long-document RAG, coding agents, safety guardrails
Watch out forHardware cost, latency on free tiers, thinking mode token overhead

NVIDIA build.nvidia.com model card for Nemotron 3 Ultra 550B showing specs and deploy options

NVIDIA NIM model card on build.nvidia.com for Nemotron 3 Ultra. Screenshot from build.nvidia.com, captured June 5, 2026. UI may change.

Why Nemotron 3 Ultra exists (in one paragraph)

Chatbots that answer once are giving way to agents that run for minutes or hours. Each step adds more tokens: plans, tool JSON, stderr logs, retrieved docs, and sub-agent replies. NVIDIA’s pitch is not only “smarter answers” but cheaper long jobs: Ultra is post-trained on tool-heavy RL environments, uses MTP to draft multiple tokens per step, and ships NVFP4 so one checkpoint runs on Hopper and Blackwell without maintaining separate weight files.

It is aimed at long agent workflows and ships a configurable reasoning mode in the chat template.

Nemotron 3 family: Ultra, Super, Nano, and the June 4 extras

Nemotron 3 is a family, not one download.

ModelScale (total / active)Role
Nemotron 3 Ultra550B / 55BFrontier orchestration, hard reasoning, 1M context
Nemotron 3 Super120B / 12BStrong open model that is easier to host; common :free OpenRouter slug
Nemotron 3 NanoSmaller edge variantsOn-device and high-volume routing
Nemotron 3.5 Content Safety4B guardrailPolicy classification across text + image
Nemotron 3.5 ASR0.6B streamingMultilingual speech for voice agents

Nemotron 3 Super (120B) shipped in the same family and is easier to host. Treat Ultra as the planner for hard steps and Super as the workhorse for bulk tool calls unless a specific benchmark is your only decision factor.

Architecture without the jargon wall

LatentMoE (Mixture of Experts, compressed)

Classic MoE models route full-width vectors to experts. LatentMoE projects tokens into a smaller latent space before routing. NVIDIA claims better accuracy per byte moved across the GPU mesh. You still get 550B total capacity, but only 55B fire per token, which keeps inference feasible on a small number of Blackwell nodes.

Mamba-2 + Attention hybrid

Mamba-2 layers handle long sequences efficiently. Attention layers sit where precise recall matters (needle-in-haystack facts inside a 1M-token repo). The blend is why NVIDIA quotes 94.7% on RULER at 1M in BF16 mode: the model is tuned for “find paragraph 17,482” style tasks, not only chat polish.

Multi-Token Prediction (MTP)

MTP heads predict several future tokens per forward pass. Training uses a shared-weight design; inference enables speculative decoding (vLLM nemotron_h_mtp with five draft tokens in the official recipe). Plain English: agent loops that emit long tool arguments or code blocks should finish faster per wall-clock second when MTP is enabled.

NVFP4 everywhere it is safe

Weights, activations, and gradients use NVFP4 during pre-training where stable. Sensitive layers (embeddings, QKV, MTP) stay in BF16 or MXFP8. The result is one NVFP4 checkpoint that NVIDIA says runs up to 5x higher throughput than BF16 on Blackwell at similar interactivity. That is a vendor claim; always benchmark your own agent harness.

Thinking mode

Set enable_thinking=True in the chat template. The model emits a reasoning trace, then the user-facing answer. Agent frameworks must parse both streams (vLLM --reasoning-parser nemotron_v3). For production chat UIs that do not show chain-of-thought, turn thinking off to save tokens.

Benchmarks (vendor table, June 2026)

We did not rerun these suites. Numbers come from the build.nvidia.com model card and the NVIDIA technical blog. Use them to see strengths, not to crown a single winner.

AreaBenchmarkNemotron 3 Ultra (BF16)Plain meaning
CodingSWE-Bench Verified71.9%Can it fix real GitHub issues end-to-end?
CodingTerminal Bench 2.156.4%Can it drive a shell like a human dev?
AgentsPinchBench90.0%Multi-tool productivity-style tasks
AgentsTau-Bench v3 (avg)70.9%Customer-service style tool sims
KnowledgeGPQA (no tools)87.0%Hard science multiple choice
Long contextRULER @ 1M94.7%Retrieval across million-token windows
Long contextAA-LCR65.4%Long-document aggregation
InstructionIFBench81.7%Following fiddly prompt constraints

NVIDIA also publishes competitive tables against GLM 5.1, Kimi K2.6, and Qwen3.5 on agent productivity (PinchBench) and long-context RULER. Ultra leads or ties several agent scores while advertising lower cost per completed task on SWE-Bench style runs. Treat cross-vendor tables as launch marketing until independent labs reproduce them.

Who should use Ultra vs Super vs closed APIs

You are…Likely path
Indie dev testing ideasOpenRouter nvidia/nemotron-3-ultra-550b-a55b:free or build.nvidia.com playground
Startup shipping agentsUltra for planner steps, Super for bulk tool calls, closed model as fallback
Enterprise with Blackwell podsSelf-host NVFP4 with vLLM or TensorRT-LLM cookbooks
Regulated on-prem onlyDownload weights + OpenMDW license review with legal
Need best single-shot coding scoreBenchmark your repo; Ultra is strong but not always #1

How to try it today (no cluster required)

1. NVIDIA build.nvidia.com (fastest UI)

Sign in, open the Nemotron 3 Ultra model page, and run prompts in the hosted playground. Trial terms apply via NVIDIA API Trial Terms of Service.

Hugging Face model page for NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 with download stats and tags

Hugging Face repository for the NVFP4 Ultra checkpoint. Screenshot from huggingface.co, captured June 5, 2026. Download counts change daily.

2. OpenRouter (API key, fits existing agents)

Route any OpenAI-compatible client to OpenRouter. Model slug: nvidia/nemotron-3-ultra-550b-a55b. Add :free while the promotion lasts (see our OpenRouter free models guide).

OpenRouter catalog page for nvidia/nemotron-3-ultra-550b-a55b with pricing and context limits

OpenRouter listing for Nemotron 3 Ultra. Screenshot from openrouter.ai, captured June 5, 2026. Pricing and context limits may change.

3. Hugging Face weights (self-host or fine-tune)

Primary production artifact: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4. BF16 base and post-training datasets live in NVIDIA’s Nemotron v3 collections for reproducibility.

4. Partner hosts (day-zero list)

NVIDIA’s launch blog names Perplexity Pro, Together AI, Fireworks, Baseten, Modal, CoreWeave, Amazon SageMaker JumpStart, Microsoft Foundry, and others. Pick the host that matches your compliance region and existing contract.

Self-hosting snapshot (experienced teams only)

If you manage bare metal, start from NVIDIA’s vLLM v0.22.0 container recipe:

  • Single node: 4x B200, tensor parallel 4, expert parallel on, FP8 KV cache, MTP speculative config with five tokens.
  • Context: Default recipes use 256K max-model-len. Set VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and 1048576 only if you truly need 1M and have the RAM.
  • Tool calls: Enable auto tool choice with qwen3_coder parser per NVIDIA docs (same family as several Chinese open models).
  • Multi-node: Ray head + workers, pipeline parallel 2, distributed executor backend ray.

SGLang and TensorRT-LLM cookbooks ship the same day with parallel flags. Unless you already run NeMo clusters, do not start here. Use OpenRouter or NIM first.

Training and openness (why enterprises care)

NVIDIA positions Nemotron as open weights + open data + open recipes:

  • Pre-training: ~20T tokens, NVFP4 recipe, disclosed corpora on Hugging Face.
  • SFT: Synthetic code, math, tool-calling, long-document aggregation sets.
  • RL: Async GRPO across NeMo Gym environments (math, code, multi-turn tools).
  • MOPD: Multi-Teacher On-Policy Distillation from 10+ domain teachers that score the student’s own rollouts.

OpenMDW-1.1 replaces older Nemotron license text so legal teams have one framework for weights, code, and docs. That matters if you fine-tune on private data and redistribute adapters.

Agent frameworks NVIDIA highlights

Ultra is trained for harness diversity, not a single IDE:

  • Hermes Agent, OpenClaw, OpenHands, CrewAI, LangChain Deep Agents, Pi, Cline, Factory, OpenCode, and more list Nemotron in June 2026 docs.
  • NemoClaw plus OpenShell bundle a safer runtime story for always-on agents (early preview in launch week).

A common production pattern: Ultra plans, a smaller model executes bash or SQL, Ultra verifies before merge. That mirrors how teams use Claude Opus with Haiku, but with open weights on the orchestrator.

Cost and latency: what “5x throughput” and “30% cheaper tasks” mean

NVIDIA cites 5x output speed vs other open frontier models on Artificial Analysis style charts (Blackwell endpoints). Separately, they claim ~30% lower cost to finish SWE-Bench Verified style agent jobs because Ultra uses fewer tokens per turn. Your mileage depends on:

  • Whether thinking mode is on.
  • How chatty your tool schema is.
  • If MTP and FP8 KV cache are enabled in vLLM.
  • Batch size and concurrent agents on the same GPU.

Run an A/B with your real Jira tickets, not a demo prompt.

Nemotron 3 Ultra vs Nemotron 3 Super

QuestionUltraSuper
Parameter count550B / 55B active120B / 12B active
ContextUp to 1MCheck Super card (shorter in most hosts)
HostingData-center Blackwell classFeasible on fewer GPUs; popular :free route
Best fitOrchestration, 1M RAG, hardest agent stepsDaily tool calls, cheaper API, local experiments

If you only need quick classification or shallow tool calls, Super is the economic default. Ultra is for when the agent must hold an architecture decision across fifty prior steps.

Limitations and honest caveats

  • Hardware wall: There is no realistic “run Ultra on a gaming GPU” path. Plan cloud APIs unless you operate Blackwell racks.
  • Free tiers throttle: OpenRouter :free models can queue or rate-limit. Do not benchmark SLA-sensitive prod flows on free routes alone.
  • Banking sim weakness: Tau-Bench banking scores in NVIDIA’s own table are low (~22%). Do not assume finance agent readiness without private evals.
  • License still new: OpenMDW is clearer than many custom licenses, but your compliance team must still approve redistribution.
  • Benchmark marketing: Launch-week tables cherry-pick friendly harnesses. Replicate on your stack.

Same-day siblings: safety and voice

June 4 also dropped Nemotron 3.5 Content Safety (4B, 23 categories, 12 languages) and Nemotron 3.5 ASR (40+ languages, sub-100 ms streaming). Ultra does not replace those. Use Safety as a guardrail model in front of Ultra, and ASR if you build voice-native agents.

Bottom line

Nemotron 3 Ultra is NVIDIA’s open bet that agent orchestration needs a frontier model you can host, fine-tune, and audit, with 1M context and NVFP4 speed on Blackwell. It is not the cheap daily driver. Pair it with Nemotron 3 Super, OpenRouter free routes, or your existing closed API for grunt work.

Try first: build.nvidia.com playground or OpenRouter free slug. Self-host when: you already run 4x B200 and need data residency. Skip when: you only need inline code completion in an IDE and no agent loop.


Changelog

  • 2026-06-05: First publish. Benchmarks, API and OpenRouter access, and when to self-host vs use Nemotron 3 Super.

Frequently asked

8 questions
What is NVIDIA Nemotron 3 Ultra?

Nemotron 3 Ultra is NVIDIA's frontier open model released June 4, 2026. It has 550 billion total parameters with 55 billion active per forward pass (Mixture-of-Experts). It targets long-running agents, tool use, coding, and reasoning up to 1 million tokens of context. Weights, training data, and recipes ship under OpenMDW-1.1.

Is Nemotron 3 Ultra free?

You can try it free on build.nvidia.com (trial terms apply) and on OpenRouter with the nvidia/nemotron-3-ultra-550b-a55b:free slug while NVIDIA's promotion lasts. Self-hosting is not free. You need roughly four B200 GPUs or eight H100s for the NVFP4 checkpoint.

How does Nemotron 3 Ultra compare to Nemotron 3 Super?

Ultra is the 550B orchestration model for hard agent steps and long context. Super is the 120B sibling (12B active) that is cheaper and easier to run locally or on a single high-end node. Many teams route easy tool calls to Super and hard planning to Ultra.

What hardware do I need to run Nemotron 3 Ultra locally?

NVIDIA lists a minimum of 4x GB200, 4x B200, 4x GB300, 4x B300, or 8x H100 for the NVFP4 weights plus KV cache. Single-node vLLM recipes target 4x B200. Multi-node Ray setups are recommended beyond that.

Does Nemotron 3 Ultra support thinking mode?

Yes. The chat template exposes enable_thinking=True or False. With thinking on, the model writes a reasoning trace before the final answer. Agent hosts like vLLM use the nemotron_v3 reasoning parser. Turn thinking off for faster chat-style replies.

Where can I download Nemotron 3 Ultra weights?

Hugging Face hosts BF16 and NVFP4 checkpoints, including nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 for production inference. NVIDIA also publishes base, post-training datasets, and NeMo cookbooks on GitHub.

Is Nemotron 3 Ultra good for coding agents?

NVIDIA reports 71.9% on SWE-Bench Verified (BF16) and 56.4% on Terminal Bench 2.1. That is strong for an open weight model, though some rivals score higher on raw coding leaderboards. Ultra's pitch is fewer tokens per agent turn and better long-horizon planning, not only raw patch accuracy.

Nemotron 3 Ultra vs Claude or GPT for agents?

Closed models still win many one-shot coding benches. Ultra is aimed at teams that want open weights, on-prem deploy, or a mix of frontier orchestration plus cheaper worker models. Pair Ultra on hard steps with Nemotron 3 Super or another small model on bulk tool calls to control cost.

More in Models

View all