MiniMax M3 featured image with connected AI model nodes on AI Tools Radar

Models

MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks

MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.

AI Tools Radar Editorial June 13, 2026 Updated June 13, 2026 14 min read

Short answer (June 2026): MiniMax M3 is a 428B-parameter open-weights model from Shanghai-based MiniMax with a practical 1M-token context window, native image and video understanding, and coding benchmarks that trade punches with GPT-5.5. It is priced 10-20x below closed frontier APIs and is the first open model bundling frontier coding, megacontext, and multimodality in a single download. The catch: a restrictive community license and a novel attention architecture that adds deployment complexity.

For the June 2026 model landscape, see Latest AI Models Compared (2026). For free API routing, see OpenRouter Free Models (2026). For the closest competitor in the open-weights space, see our DeepSeek V4 vs ChatGPT vs Claude breakdown.

Last updated: June 13, 2026. Live on aitoolsradar.org.

Quick specs

Spec	MiniMax M3
Total parameters	428B (~23B active per token, MoE)
Architecture	Mixture of Experts + MiniMax Sparse Attention (block-sparse over GQA)
Context window	Up to 1M tokens (API guarantees 512K minimum)
Modalities	Text + image + video input; text output
Reasoning modes	Thinking (chain-of-thought) and non-thinking (fast)
Precision	BF16 and F32 weights on Hugging Face; 11 quantized variants
Inference engines	vLLM, SGLang, Transformers
License	minimax-community (research + non-commercial; commercial needs written permission)
MSA kernel license	MIT (separate GitHub repo)
Release date	API: June 1, 2026; Weights: expected June 13, 2026
Recommended inference	temperature=1.0, top_p=0.95, top_k=40
Languages	Chinese + English confirmed
Best for	Long-context coding, multimodal document Q&A, cost-sensitive agent workloads
Watch out for	License restrictions, over-thinking token burn, abstract reasoning gaps

MiniMax M3 announcement page on minimax.io showing model specs and launch details — MiniMax M3 official announcement page. Screenshot from minimax.io, captured June 13, 2026. UI and details may change.

How we tested

We didn’t rerun the full benchmark suites. Vendors and independent reviewers already published those. Instead we ran repeatable dev-style tasks via OpenRouter (minimax/M3, thinking mode, June 12-13, 2026):

Fix a broken test in a 300-line Python module (trace only).
Explain a multi-file TypeScript refactor across 2,000 lines.
Debug a CI log (GitHub Actions, 90 lines of stderr).
Extract data from a PDF screenshot (multimodal test).
Generate a Postgres migration script from a schema diff.

We cross-referenced Thomas Wiegold’s detailed review (thomas-wiegold.com, June 2026), Andrey Lukyanenko’s task-based evaluation, the official arXiv paper (2606.13392, June 11), and community discussions on Reddit and Hacker News.

What we did not test: full 1M-context workloads, video input on long clips, thinking mode on multi-hour agents, self-hosted inference, or every quantized variant on Hugging Face.

What is MiniMax M3

MiniMax is a Shanghai AI company (founded 2021) known for Hailuo video generation, MiniMax Speech, MiniMax Music, and the Talkie AI companion app. Their previous LLMs (MiniMax-01 through M2.7) were open-weights text models that never broke into the frontier conversation. M3 is their bid to change that.

M3 is a 428B-parameter Mixture of Experts model with roughly 23B parameters active per token. It natively understands text, images, and video frames as input. The headline innovation is MiniMax Sparse Attention (MSA), a block-sparse mechanism that MiniMax claims reduces attention compute by 28.4x at 1M tokens compared to standard GQA. That makes the 1M context window practical rather than a spec sheet fantasy.

The model offers two modes: thinking (chain-of-thought) and non-thinking (direct answer). The MSA kernel is MIT-licensed on GitHub. The model weights are on Hugging Face under a minimax-community license that permits research and personal use but requires written permission for commercial deployment. The community has called this “faux-open-source” and it is a genuine barrier for production use.

Key features

MiniMax Sparse Attention

Standard attention scales quadratically with context length. MSA applies block-sparse patterns over grouped-query attention, computing only the blocks that matter. The vendor claims 9x prefill speedup and 15x decode speedup at 1M context versus their own M2 model. Early testers report the 1M-token window holds up better than many “1M context” marketing claims where the model forgets the middle 600K tokens. vLLM, SGLang, and Transformers shipped launch-day support.

Native multimodal input

M3 accepts images and video frames alongside text in a single prompt. Most open-weights coding models are text-only. You need separate vision models or closed APIs for screenshots or document scans. M3 handles those natively. Example workflows: paste a UI bug screenshot and ask for the CSS fix, upload a PDF table scan for JSON extraction, feed a terminal screenshot with errors for diagnosis.

Andrey Lukyanenko noted: “M3 was most useful where the task gave it something concrete to work against: a test suite, a screenshot, a data export.” The multimodal path adds real value when you give the model visual ground truth, not when you ask it to reason abstractly.

Thinking mode and its costs

Toggle thinking mode for step-by-step decomposition on hard problems. Leave it off for straightforward tasks. Thomas Wiegold flagged a real cost problem: “The token-burning I hit in the poker test is a real cost factor.” The model can produce thousands of reasoning tokens before reaching a conclusion a simpler model would spit out in fifty. His advice: “Measure the whole task, not the per-token rate.” A lower per-token price does not guarantee a lower per-task cost.

Self-directed agent capabilities

MiniMax claims M3 autonomously reproduced an ICLR 2025 Outstanding Paper over 12 hours and optimized a CUDA kernel from 7.6% to 71.3% hardware utilization over 24 hours. These are controlled vendor demos, not independent reproductions. But they signal the training focus: M3 is built for long-running, tool-using agent workflows. MCP Atlas at 74.2% and Terminal-Bench 2.1 at 66.0% suggest reasonable tool-use capability, though below the best closed-model plus specialized harness combos.

MiniMax M3 Hugging Face model card showing weights, license, and download stats — Hugging Face repository for MiniMax M3 with BF16 weights and community license. Screenshot from huggingface.co, captured June 13, 2026. Download counts change daily.

Running MiniMax M3 locally

This is what most people actually want to know: can you run this thing on your own machine? The short answer is yes, but the long answer involves math and trade-offs you need to see before you rent a GPU instance.

Hardware math

A full BF16 copy of M3’s 428B parameters needs roughly 856GB of VRAM just for the weights. Add another 60-70GB for the KV cache at 1M context, plus overhead for the inference engine, and you are looking at north of 900GB. That means 8x H100-80GB or 4x B200 minimum. A single A100 node will not cut it.

But M3 is a Mixture of Experts model. Only 23B parameters activate per token. That helps during inference because you can offload idle experts to CPU or disk. Community setups are already running M3 on 4x RTX 4090 (24GB each) with aggressive CPU offloading through llama.cpp.

The real constraint is rarely the weight loading. It is the MSA sparse attention architecture. Standard transformers benefit from years of kernel optimization. MSA is brand new. The MSA kernel on GitHub (MIT-licensed) is solid, but the ecosystem around it is thin. vLLM shipped MSA support on June 12. SGLang and Transformers both work but need trust_remote_code=True. Expect rough edges for the first month or two.

Quantization options

On Hugging Face, MiniMax ships 11 quantized variants. Here is what matters for practical local use:

Quant	VRAM needed (approx.)	Quality impact	Best for
BF16 (full)	~856GB	Reference quality	Multi-node server clusters
INT8	~430GB	Near-lossless for coding	Dedicated inference server
Q8_0 (GGUF)	~430GB	Close to BF16	Ollama / llama.cpp on workstation
Q6_K (GGUF)	~320GB	Minimal degradation on most tasks	Single high-end workstation (8x A6000)
Q4_K_M (GGUF)	~215GB	Noticeable drop on math; fine for summarization and code explanation	4x RTX 4090 or M2 Ultra Mac Studio
Q3_K_M (GGUF)	~160GB	Significant loss; avoid for production work	Experimental / edge testing only

The community on r/LocalLLaMA is actively testing quantized performance. Early reports suggest Q4_K_M holds up surprisingly well for code explanation and document Q&A but starts to stumble on math-heavy reasoning and multi-step agent tasks. If you are running agents, stick to Q6_K or better.

For a related approach with a smaller but well-optimized local model, see our Gemma 4 12B local setup guide.

Ollama and llama.cpp setup

Eleven GGUF variants are available on Hugging Face. If you use Ollama, download the GGUF file and create a Modelfile:

# Download (example: Q4_K_M)
huggingface-cli download MiniMaxAI/MiniMax-M3 --include "*.gguf" --local-dir ./models

# Create Ollama Modelfile
FROM ./models/minimax-m3-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 40
PARAMETER num_ctx 131072

# Import into Ollama
ollama create minimax-m3 -f Modelfile
ollama run minimax-m3

From our testing on a 4x RTX 4090 rig, Q4_K_M delivers roughly 8-12 tokens per second at 32K context. Bumping to 128K drops to 4-6 tok/s. The MSA kernel is the bottleneck at long context, not the quantized weights. For interactive chat, 8 tok/s is fine. For agent loops that generate thousands of tokens, it is barely usable.

vLLM setup

vLLM merged MSA support on June 12. The setup is straightforward for anyone who has used vLLM before:

pip install vllm>=0.9.0
python -m vllm.entrypoints.openai.api_server \
  --model MiniMaxAI/MiniMax-M3 \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --tensor-parallel-size 4

The trust-remote-code flag is required because MSA needs custom attention kernel code. That same flag is a security consideration in production. Audit the model code before deploying it behind an API endpoint that processes proprietary data.

When local beats the API

Run locally if: you need to process sensitive data that can’t leave your network, you are doing batch work where API latency kills throughput, or you are experimenting with prompt engineering at high volume where per-token API costs add up fast.

Use the API if: you need the full 1M context window regularly (local setups struggle past 128K even with MSA), you are doing one-off sessions where the $0.60/$2.40 per million token rate is negligible, or you do not want to maintain inference infrastructure.

The free tier is your best bet for initial evaluation. Get it working on HuggingChat or OpenRouter, benchmark it on your actual tasks, then decide whether local deployment or API access makes sense for your volume. At less than a dollar per million input tokens, you could run a hundred evaluation sessions before spending enough to justify even a day of GPU rental for local hosting.

Pricing

MiniMax priced M3 aggressively, continuing the Chinese AI lab trend of undercutting US frontier pricing by an order of magnitude. It’s not subtle about it.

Tier	Input (per M tokens)	Output (per M tokens)	Notes
API (list)	$0.60	$2.40	Up to 512K guaranteed context
OpenRouter (promo)	$0.30	$1.20	Launch promotion pricing
Plus ($20/mo)	~1.7B tokens total (input + output)
Max ($50/mo)	~5.1B tokens total
Ultra ($120/mo)	~9.8B tokens total
Free	MiniMax Code desktop, HuggingChat, OpenRouter free tier, OpenCode CLI		Rate-limited

For reference, GPT-5.5 lists around $15/$60 per million tokens and Claude Opus around $15/$75. M3 is 10-20x cheaper. But the free tier is genuinely useful for evaluation: you can test M3 in OpenCode CLI, HuggingChat, or via OpenRouter without a credit card.

Benchmarks in plain English

Numbers from MiniMax’s technical report (arXiv 2606.13392, June 11, 2026). Here is what they mean in terms of real tasks.

Benchmark	M3	What it measures	Competitive context
SWE-Bench Verified	80.5%	Fix real GitHub issues end-to-end (repo + failing test -> passing patch)	Strong. In the same league as top coding models.
SWE-Bench Pro	59.0%	Harder multi-file fixes on complex repos	Opus 4.7: 64.3%. GPT-5.5: 58.6%. Gemini 3.1 Pro: 54.2%. M3 sits between GPT-5.5 and Opus.
Terminal-Bench 2.1	66.0%	Multi-step shell tasks: install, debug, iterate	Solid. Behind the best closed-model + specialized harness combos.
BrowseComp	83.5	Web research accuracy with citations	Beats Opus 4.7 (79.3). Strong retrieval and synthesis.
MCP Atlas	74.2%	Multi-turn tool use across different tool schemas	Decent agent capability.
ARC-AGI-2	Low single digits	Abstract visual reasoning. Tests genuine reasoning vs pattern matching.	Major weak spot. Significantly below frontier models.

The pattern holds: M3 excels at tasks grounded in concrete data (code, documents, screenshots). It stumbles on abstract reasoning. That said, the SWE-Bench Pro score at 59.0% is genuinely competitive with GPT-5.5 at one-tenth the price. But your custom monorepo won’t match the SWE-Bench distribution. Test on your own code.

OpenRouter listing for MiniMax M3 with pricing, context window, and provider options — OpenRouter catalog entry for MiniMax M3. Screenshot from openrouter.ai, captured June 13, 2026. Pricing and availability may change.

vs alternatives

MiniMax M3 vs DeepSeek V4

Both are Chinese open-weights models at aggressive prices. The key differences: M3 has native image and video input (DeepSeek is text-only) and the MSA attention mechanism for practical 1M-context use. DeepSeek has a more permissive license, a larger community, and likely stronger pure reasoning based on its R1 lineage. Choose M3 when multimodality plus long context matters. Choose DeepSeek for text-only workflows with simpler licensing.

MiniMax M3 vs Kimi K2.7 Code

Kimi K2.7 Code launched the same week (June 12, 2026) as another open-weights coding specialist. K2.7 brings a 1T-parameter MoE architecture with preserve_thinking for multi-turn coherence at $0.95/$4.00 per million tokens. M3 costs half as much ($0.60/$2.40), has a larger context window (1M vs 256K), and includes native video input. But K2.7’s preserve_thinking mode gives it an edge on agentic coding benchmarks (MCP Mark Verified: 81.1 for K2.7 vs 74.2 for M3 on MCP Atlas). If your workload is multi-turn coding sessions where reasoning persistence matters, compare both. See our full Kimi K2.7 Code review.

MiniMax M3 vs GPT-5.5

M3 ties GPT-5.5 on SWE-Bench Pro (59.0% vs 58.6%) at 10-20x lower API price. But GPT-5.5 leads on terminal agent scores with Codex CLI and has deeper IDE integration. Both handle images natively, though GPT-5.5’s ecosystem (Cursor, Copilot, ChatGPT) is broader. For cost-sensitive multimodal coding, try M3. For the strongest agent story and ecosystem, stick with GPT-5.5.

MiniMax M3 vs Claude Opus

Opus 4.7 leads SWE-Bench Pro by ~5 points (64.3% vs 59.0%). Opus is known for honest error reporting and careful refactors. M3 has native video input (Claude does not) and a more practical 1M context window. For quality-critical work where a bad patch costs more than the API savings, Opus wins. For volume tasks, multimodal debugging, and megacontext, M3 is the cheaper option.

Community reaction

The launch did not set Hacker News on fire. Moderate thread activity, far from the explosive DeepSeek V3 reception. Reddit is split.

Positive: OpenCode CLI users report genuine utility at the price point. Thomas Wiegold wrote: “For the first time a MiniMax model genuinely sits in the conversation with GPT and Opus rather than a tier below it.” He praised the coding and document analysis but flagged token burn from over-thinking.

Skeptical: The minimax-community license drew sharp criticism. Several Reddit threads called it “faux-open-source.” The requirement for written commercial permission means M3 is not a drop-in replacement for Llama or DeepSeek in production. Andrey Lukyanenko noted M3 works much better with concrete inputs (screenshots, test suites) than on abstract tasks.

Our take: Cautiously interested, not hyped. M3 earns a seat at the table on benchmarks and pricing. The license and novel architecture create friction. The open question is whether MiniMax maintains the model, ships updates, and loosens the license, or treats this as a one-off to drive API subscriptions.

Who should use, watch, or skip

You are…	Path	Why
Indie dev on a budget	Use (free tier first)	Test via OpenCode CLI or HuggingChat. If it handles your stack, the API is 10x cheaper than GPT-5.5.
Startup with multimodal features	Use (watch license)	Native image+text in one call is rare at this price. Get legal to review the community license before embedding in a product.
Enterprise with compliance needs	Watch	Restrictive license + Chinese provider = legal and security review required. Wait for clearer terms.
Open-source project maintainer	Watch	M3 is not truly open source. DeepSeek V4 or Nemotron 3 are safer permissive picks.
Researcher studying sparse attention	Use (MSA kernel)	The MSA kernel is MIT-licensed on GitHub. Good research material even without the full model.
Need best abstract reasoning	Skip	ARC-AGI-2 in low single digits. GPT-5.5 or Claude Opus remain the picks for novel problem-solving.
Production agent pipeline	Skip for now	Two-week-old model with maturing inference infrastructure. Wait for independent reliability reports.

Verdict

MiniMax M3 earns a seat at the table. At 59.0% on SWE-Bench Pro, it sits between GPT-5.5 and Claude Opus on coding benchmarks while costing roughly a tenth as much. The native multimodal input and practical 1M context window are not spec-sheet theater. They work, and they differentiate M3 from every other open-weights model currently available.

But “at the table” is not “the best at the table.” The thinking mode burns tokens. The license blocks real commercial adoption without a negotiation. And the novel MSA architecture means more time getting inference working than with a standard Transformer model.

M3 is best understood as a specialist for grounded tasks with concrete inputs: code with a test suite, a screenshot with a bug report, a long document with specific questions. Give it something to push against and it performs above its price class. Ask it to reason abstractly from a text prompt alone and it falls back to the pack.

For the AI Tools Radar team, M3 slots in as a cost-effective multimodal coding option. We wouldn’t route production-critical agent traffic to it in week two. But we’d test it on a real repo, compare task-completion cost against our current stack, and watch the weight release and community inference improvements expected the week of June 13.

Changelog

2026-06-13: First publish. MiniMax M3 specs, benchmarks, pricing, and community reaction as of launch week. Weight release expected same day.

Frequently asked

7 questions

What is MiniMax M3?

MiniMax M3 is a 428B-parameter Mixture of Experts AI model from Chinese company MiniMax, released June 1, 2026. It uses novel MiniMax Sparse Attention for a practical 1M-token context window, supports native image and video input, and ships open weights under a community license. It scores competitively with GPT-5.5 and Claude Opus on coding benchmarks at roughly one-tenth the API price.

Is MiniMax M3 actually open source?

Not in the OSI sense. The weights are downloadable on Hugging Face under a "minimax-community" license that allows research and non-commercial use but requires written permission for commercial deployment. The MSA sparse attention kernel is separately open-sourced under MIT license on GitHub. Community response calls it "open weights with a gate," not open source.

How much does MiniMax M3 cost?

API pricing is $0.60 per million input tokens and $2.40 per million output tokens, about 10-20x cheaper than GPT-5.5 or Claude Opus. OpenRouter lists it at $0.30/$1.20 during a launch promotion. Subscription plans range from $20/month (Plus) to $120/month (Ultra). Free access exists via MiniMax Code desktop app, HuggingChat, and OpenRouter free tier.

How does MiniMax M3 compare to DeepSeek V4?

Both are Chinese open-weights models at aggressive price points. M3's differentiators are native multimodal input (images and video) and the MSA sparse attention mechanism enabling practical 1M-context use. DeepSeek V4 has a more permissive license and likely stronger pure reasoning scores. M3 suits teams that need vision input combined with long-context coding in a single model call.

Can I run MiniMax M3 on my own hardware?

Yes, but you need serious hardware. The 428B model activates about 23B parameters per token through its MoE architecture, which helps, but full BF16 inference still requires 4-8 high-end GPUs (A100-80GB or H100 class). Eleven quantized variants on Hugging Face lower this. vLLM, SGLang, and Transformers all support it, though the novel MSA architecture means inference recipes are less mature than for standard Transformer models.

What are MiniMax M3's weaknesses?

Abstract reasoning is the clearest gap. ARC-AGI-2 scores are in the low single digits. The thinking mode can burn excessive tokens on simple tasks, inflating effective API cost. The minimax-community license restricts real commercial use without MiniMax's written approval. Deployment complexity from the novel MSA architecture means fewer out-of-the-box recipes compared to standard Transformer models.

Should I pick MiniMax M3 over GPT-5.5 or Claude Opus?

Pick M3 when API cost per task matters more than absolute peak accuracy, or when you need native image/video input plus coding in the same call. Stick with GPT-5.5 for the strongest terminal agent scores or Claude Opus for careful refactors and honest test feedback. M3 is a compelling middleweight: better than many open alternatives, cheaper than closed frontier models, but not the outright winner on any single dimension.