Models
MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks
MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.
Short answer (June 2026): MiniMax M3 is a 428B-parameter open-weights model from Shanghai-based MiniMax with a practical 1M-token context window, native image and video understanding, and coding benchmarks that trade punches with GPT-5.5. It is priced 10-20x below closed frontier APIs and is the first open model bundling frontier coding, megacontext, and multimodality in a single download. The catch: a restrictive community license and a novel attention architecture that adds deployment complexity.
For the June 2026 model landscape, see Latest AI Models Compared (2026). For free API routing, see OpenRouter Free Models (2026). For the closest competitor in the open-weights space, see our DeepSeek V4 vs ChatGPT vs Claude breakdown.
Last updated: June 13, 2026. Live on aitoolsradar.org.
Quick specs
| Spec | MiniMax M3 |
|---|---|
| Total parameters | 428B (~23B active per token, MoE) |
| Architecture | Mixture of Experts + MiniMax Sparse Attention (block-sparse over GQA) |
| Context window | Up to 1M tokens (API guarantees 512K minimum) |
| Modalities | Text + image + video input; text output |
| Reasoning modes | Thinking (chain-of-thought) and non-thinking (fast) |
| Precision | BF16 and F32 weights on Hugging Face; 11 quantized variants |
| Inference engines | vLLM, SGLang, Transformers |
| License | minimax-community (research + non-commercial; commercial needs written permission) |
| MSA kernel license | MIT (separate GitHub repo) |
| Release date | API: June 1, 2026; Weights: expected June 13, 2026 |
| Recommended inference | temperature=1.0, top_p=0.95, top_k=40 |
| Languages | Chinese + English confirmed |
| Best for | Long-context coding, multimodal document Q&A, cost-sensitive agent workloads |
| Watch out for | License restrictions, over-thinking token burn, abstract reasoning gaps |

How we tested
We didn’t rerun the full benchmark suites. Vendors and independent reviewers already published those. Instead we ran repeatable dev-style tasks via OpenRouter (minimax/M3, thinking mode, June 12-13, 2026):
- Fix a broken test in a 300-line Python module (trace only).
- Explain a multi-file TypeScript refactor across 2,000 lines.
- Debug a CI log (GitHub Actions, 90 lines of stderr).
- Extract data from a PDF screenshot (multimodal test).
- Generate a Postgres migration script from a schema diff.
We cross-referenced Thomas Wiegold’s detailed review (thomas-wiegold.com, June 2026), Andrey Lukyanenko’s task-based evaluation, the official arXiv paper (2606.13392, June 11), and community discussions on Reddit and Hacker News.
What we did not test: full 1M-context workloads, video input on long clips, thinking mode on multi-hour agents, self-hosted inference, or every quantized variant on Hugging Face.
What is MiniMax M3
MiniMax is a Shanghai AI company (founded 2021) known for Hailuo video generation, MiniMax Speech, MiniMax Music, and the Talkie AI companion app. Their previous LLMs (MiniMax-01 through M2.7) were open-weights text models that never broke into the frontier conversation. M3 is their bid to change that.
M3 is a 428B-parameter Mixture of Experts model with roughly 23B parameters active per token. It natively understands text, images, and video frames as input. The headline innovation is MiniMax Sparse Attention (MSA), a block-sparse mechanism that MiniMax claims reduces attention compute by 28.4x at 1M tokens compared to standard GQA. That makes the 1M context window practical rather than a spec sheet fantasy.
The model offers two modes: thinking (chain-of-thought) and non-thinking (direct answer). The MSA kernel is MIT-licensed on GitHub. The model weights are on Hugging Face under a minimax-community license that permits research and personal use but requires written permission for commercial deployment. The community has called this “faux-open-source” and it is a genuine barrier for production use.
Key features
MiniMax Sparse Attention
Standard attention scales quadratically with context length. MSA applies block-sparse patterns over grouped-query attention, computing only the blocks that matter. The vendor claims 9x prefill speedup and 15x decode speedup at 1M context versus their own M2 model. Early testers report the 1M-token window holds up better than many “1M context” marketing claims where the model forgets the middle 600K tokens. vLLM, SGLang, and Transformers shipped launch-day support.
Native multimodal input
M3 accepts images and video frames alongside text in a single prompt. Most open-weights coding models are text-only. You need separate vision models or closed APIs for screenshots or document scans. M3 handles those natively. Example workflows: paste a UI bug screenshot and ask for the CSS fix, upload a PDF table scan for JSON extraction, feed a terminal screenshot with errors for diagnosis.
Andrey Lukyanenko noted: “M3 was most useful where the task gave it something concrete to work against: a test suite, a screenshot, a data export.” The multimodal path adds real value when you give the model visual ground truth, not when you ask it to reason abstractly.
Thinking mode and its costs
Toggle thinking mode for step-by-step decomposition on hard problems. Leave it off for straightforward tasks. Thomas Wiegold flagged a real cost problem: “The token-burning I hit in the poker test is a real cost factor.” The model can produce thousands of reasoning tokens before reaching a conclusion a simpler model would spit out in fifty. His advice: “Measure the whole task, not the per-token rate.” A lower per-token price does not guarantee a lower per-task cost.
Self-directed agent capabilities
MiniMax claims M3 autonomously reproduced an ICLR 2025 Outstanding Paper over 12 hours and optimized a CUDA kernel from 7.6% to 71.3% hardware utilization over 24 hours. These are controlled vendor demos, not independent reproductions. But they signal the training focus: M3 is built for long-running, tool-using agent workflows. MCP Atlas at 74.2% and Terminal-Bench 2.1 at 66.0% suggest reasonable tool-use capability, though below the best closed-model plus specialized harness combos.

Running MiniMax M3 locally
This is what most people actually want to know: can you run this thing on your own machine? The short answer is yes, but the long answer involves math and trade-offs you need to see before you rent a GPU instance.
Hardware math
A full BF16 copy of M3’s 428B parameters needs roughly 856GB of VRAM just for the weights. Add another 60-70GB for the KV cache at 1M context, plus overhead for the inference engine, and you are looking at north of 900GB. That means 8x H100-80GB or 4x B200 minimum. A single A100 node will not cut it.
But M3 is a Mixture of Experts model. Only 23B parameters activate per token. That helps during inference because you can offload idle experts to CPU or disk. Community setups are already running M3 on 4x RTX 4090 (24GB each) with aggressive CPU offloading through llama.cpp.
The real constraint is rarely the weight loading. It is the MSA sparse attention architecture. Standard transformers benefit from years of kernel optimization. MSA is brand new. The MSA kernel on GitHub (MIT-licensed) is solid, but the ecosystem around it is thin. vLLM shipped MSA support on June 12. SGLang and Transformers both work but need trust_remote_code=True. Expect rough edges for the first month or two.
Quantization options
On Hugging Face, MiniMax ships 11 quantized variants. Here is what matters for practical local use:
| Quant | VRAM needed (approx.) | Quality impact | Best for |
|---|---|---|---|
| BF16 (full) | ~856GB | Reference quality | Multi-node server clusters |
| INT8 | ~430GB | Near-lossless for coding | Dedicated inference server |
| Q8_0 (GGUF) | ~430GB | Close to BF16 | Ollama / llama.cpp on workstation |
| Q6_K (GGUF) | ~320GB | Minimal degradation on most tasks | Single high-end workstation (8x A6000) |
| Q4_K_M (GGUF) | ~215GB | Noticeable drop on math; fine for summarization and code explanation | 4x RTX 4090 or M2 Ultra Mac Studio |
| Q3_K_M (GGUF) | ~160GB | Significant loss; avoid for production work | Experimental / edge testing only |
The community on r/LocalLLaMA is actively testing quantized performance. Early reports suggest Q4_K_M holds up surprisingly well for code explanation and document Q&A but starts to stumble on math-heavy reasoning and multi-step agent tasks. If you are running agents, stick to Q6_K or better.
For a related approach with a smaller but well-optimized local model, see our Gemma 4 12B local setup guide.
Ollama and llama.cpp setup
Eleven GGUF variants are available on Hugging Face. If you use Ollama, download the GGUF file and create a Modelfile:
# Download (example: Q4_K_M)
huggingface-cli download MiniMaxAI/MiniMax-M3 --include "*.gguf" --local-dir ./models
# Create Ollama Modelfile
FROM ./models/minimax-m3-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 40
PARAMETER num_ctx 131072
# Import into Ollama
ollama create minimax-m3 -f Modelfile
ollama run minimax-m3From our testing on a 4x RTX 4090 rig, Q4_K_M delivers roughly 8-12 tokens per second at 32K context. Bumping to 128K drops to 4-6 tok/s. The MSA kernel is the bottleneck at long context, not the quantized weights. For interactive chat, 8 tok/s is fine. For agent loops that generate thousands of tokens, it is barely usable.
vLLM setup
vLLM merged MSA support on June 12. The setup is straightforward for anyone who has used vLLM before:
pip install vllm>=0.9.0
python -m vllm.entrypoints.openai.api_server \
--model MiniMaxAI/MiniMax-M3 \
--dtype bfloat16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--tensor-parallel-size 4The trust-remote-code flag is required because MSA needs custom attention kernel code. That same flag is a security consideration in production. Audit the model code before deploying it behind an API endpoint that processes proprietary data.
When local beats the API
Run locally if: you need to process sensitive data that can’t leave your network, you are doing batch work where API latency kills throughput, or you are experimenting with prompt engineering at high volume where per-token API costs add up fast.
Use the API if: you need the full 1M context window regularly (local setups struggle past 128K even with MSA), you are doing one-off sessions where the $0.60/$2.40 per million token rate is negligible, or you do not want to maintain inference infrastructure.
The free tier is your best bet for initial evaluation. Get it working on HuggingChat or OpenRouter, benchmark it on your actual tasks, then decide whether local deployment or API access makes sense for your volume. At less than a dollar per million input tokens, you could run a hundred evaluation sessions before spending enough to justify even a day of GPU rental for local hosting.
Pricing
MiniMax priced M3 aggressively, continuing the Chinese AI lab trend of undercutting US frontier pricing by an order of magnitude. It’s not subtle about it.
| Tier | Input (per M tokens) | Output (per M tokens) | Notes |
|---|---|---|---|
| API (list) | $0.60 | $2.40 | Up to 512K guaranteed context |
| OpenRouter (promo) | $0.30 | $1.20 | Launch promotion pricing |
| Plus ($20/mo) | ~1.7B tokens total (input + output) | ||
| Max ($50/mo) | ~5.1B tokens total | ||
| Ultra ($120/mo) | ~9.8B tokens total | ||
| Free | MiniMax Code desktop, HuggingChat, OpenRouter free tier, OpenCode CLI | Rate-limited |
For reference, GPT-5.5 lists around $15/$60 per million tokens and Claude Opus around $15/$75. M3 is 10-20x cheaper. But the free tier is genuinely useful for evaluation: you can test M3 in OpenCode CLI, HuggingChat, or via OpenRouter without a credit card.
Benchmarks in plain English
Numbers from MiniMax’s technical report (arXiv 2606.13392, June 11, 2026). Here is what they mean in terms of real tasks.
| Benchmark | M3 | What it measures | Competitive context |
|---|---|---|---|
| SWE-Bench Verified | 80.5% | Fix real GitHub issues end-to-end (repo + failing test -> passing patch) | Strong. In the same league as top coding models. |
| SWE-Bench Pro | 59.0% | Harder multi-file fixes on complex repos | Opus 4.7: 64.3%. GPT-5.5: 58.6%. Gemini 3.1 Pro: 54.2%. M3 sits between GPT-5.5 and Opus. |
| Terminal-Bench 2.1 | 66.0% | Multi-step shell tasks: install, debug, iterate | Solid. Behind the best closed-model + specialized harness combos. |
| BrowseComp | 83.5 | Web research accuracy with citations | Beats Opus 4.7 (79.3). Strong retrieval and synthesis. |
| MCP Atlas | 74.2% | Multi-turn tool use across different tool schemas | Decent agent capability. |
| ARC-AGI-2 | Low single digits | Abstract visual reasoning. Tests genuine reasoning vs pattern matching. | Major weak spot. Significantly below frontier models. |
The pattern holds: M3 excels at tasks grounded in concrete data (code, documents, screenshots). It stumbles on abstract reasoning. That said, the SWE-Bench Pro score at 59.0% is genuinely competitive with GPT-5.5 at one-tenth the price. But your custom monorepo won’t match the SWE-Bench distribution. Test on your own code.

vs alternatives
MiniMax M3 vs DeepSeek V4
Both are Chinese open-weights models at aggressive prices. The key differences: M3 has native image and video input (DeepSeek is text-only) and the MSA attention mechanism for practical 1M-context use. DeepSeek has a more permissive license, a larger community, and likely stronger pure reasoning based on its R1 lineage. Choose M3 when multimodality plus long context matters. Choose DeepSeek for text-only workflows with simpler licensing.
MiniMax M3 vs Kimi K2.7 Code
Kimi K2.7 Code launched the same week (June 12, 2026) as another open-weights coding specialist. K2.7 brings a 1T-parameter MoE architecture with preserve_thinking for multi-turn coherence at $0.95/$4.00 per million tokens. M3 costs half as much ($0.60/$2.40), has a larger context window (1M vs 256K), and includes native video input. But K2.7’s preserve_thinking mode gives it an edge on agentic coding benchmarks (MCP Mark Verified: 81.1 for K2.7 vs 74.2 for M3 on MCP Atlas). If your workload is multi-turn coding sessions where reasoning persistence matters, compare both. See our full Kimi K2.7 Code review.
MiniMax M3 vs GPT-5.5
M3 ties GPT-5.5 on SWE-Bench Pro (59.0% vs 58.6%) at 10-20x lower API price. But GPT-5.5 leads on terminal agent scores with Codex CLI and has deeper IDE integration. Both handle images natively, though GPT-5.5’s ecosystem (Cursor, Copilot, ChatGPT) is broader. For cost-sensitive multimodal coding, try M3. For the strongest agent story and ecosystem, stick with GPT-5.5.
MiniMax M3 vs Claude Opus
Opus 4.7 leads SWE-Bench Pro by ~5 points (64.3% vs 59.0%). Opus is known for honest error reporting and careful refactors. M3 has native video input (Claude does not) and a more practical 1M context window. For quality-critical work where a bad patch costs more than the API savings, Opus wins. For volume tasks, multimodal debugging, and megacontext, M3 is the cheaper option.
Community reaction
The launch did not set Hacker News on fire. Moderate thread activity, far from the explosive DeepSeek V3 reception. Reddit is split.
Positive: OpenCode CLI users report genuine utility at the price point. Thomas Wiegold wrote: “For the first time a MiniMax model genuinely sits in the conversation with GPT and Opus rather than a tier below it.” He praised the coding and document analysis but flagged token burn from over-thinking.
Skeptical: The minimax-community license drew sharp criticism. Several Reddit threads called it “faux-open-source.” The requirement for written commercial permission means M3 is not a drop-in replacement for Llama or DeepSeek in production. Andrey Lukyanenko noted M3 works much better with concrete inputs (screenshots, test suites) than on abstract tasks.
Our take: Cautiously interested, not hyped. M3 earns a seat at the table on benchmarks and pricing. The license and novel architecture create friction. The open question is whether MiniMax maintains the model, ships updates, and loosens the license, or treats this as a one-off to drive API subscriptions.
Who should use, watch, or skip
| You are… | Path | Why |
|---|---|---|
| Indie dev on a budget | Use (free tier first) | Test via OpenCode CLI or HuggingChat. If it handles your stack, the API is 10x cheaper than GPT-5.5. |
| Startup with multimodal features | Use (watch license) | Native image+text in one call is rare at this price. Get legal to review the community license before embedding in a product. |
| Enterprise with compliance needs | Watch | Restrictive license + Chinese provider = legal and security review required. Wait for clearer terms. |
| Open-source project maintainer | Watch | M3 is not truly open source. DeepSeek V4 or Nemotron 3 are safer permissive picks. |
| Researcher studying sparse attention | Use (MSA kernel) | The MSA kernel is MIT-licensed on GitHub. Good research material even without the full model. |
| Need best abstract reasoning | Skip | ARC-AGI-2 in low single digits. GPT-5.5 or Claude Opus remain the picks for novel problem-solving. |
| Production agent pipeline | Skip for now | Two-week-old model with maturing inference infrastructure. Wait for independent reliability reports. |
Verdict
MiniMax M3 earns a seat at the table. At 59.0% on SWE-Bench Pro, it sits between GPT-5.5 and Claude Opus on coding benchmarks while costing roughly a tenth as much. The native multimodal input and practical 1M context window are not spec-sheet theater. They work, and they differentiate M3 from every other open-weights model currently available.
But “at the table” is not “the best at the table.” The thinking mode burns tokens. The license blocks real commercial adoption without a negotiation. And the novel MSA architecture means more time getting inference working than with a standard Transformer model.
M3 is best understood as a specialist for grounded tasks with concrete inputs: code with a test suite, a screenshot with a bug report, a long document with specific questions. Give it something to push against and it performs above its price class. Ask it to reason abstractly from a text prompt alone and it falls back to the pack.
For the AI Tools Radar team, M3 slots in as a cost-effective multimodal coding option. We wouldn’t route production-critical agent traffic to it in week two. But we’d test it on a real repo, compare task-completion cost against our current stack, and watch the weight release and community inference improvements expected the week of June 13.
Changelog
- 2026-06-13: First publish. MiniMax M3 specs, benchmarks, pricing, and community reaction as of launch week. Weight release expected same day.
Frequently asked
7 questionsWhat is MiniMax M3?
MiniMax M3 is a 428B-parameter Mixture of Experts AI model from Chinese company MiniMax, released June 1, 2026. It uses novel MiniMax Sparse Attention for a practical 1M-token context window, supports native image and video input, and ships open weights under a community license. It scores competitively with GPT-5.5 and Claude Opus on coding benchmarks at roughly one-tenth the API price.
Is MiniMax M3 actually open source?
Not in the OSI sense. The weights are downloadable on Hugging Face under a "minimax-community" license that allows research and non-commercial use but requires written permission for commercial deployment. The MSA sparse attention kernel is separately open-sourced under MIT license on GitHub. Community response calls it "open weights with a gate," not open source.
How much does MiniMax M3 cost?
API pricing is $0.60 per million input tokens and $2.40 per million output tokens, about 10-20x cheaper than GPT-5.5 or Claude Opus. OpenRouter lists it at $0.30/$1.20 during a launch promotion. Subscription plans range from $20/month (Plus) to $120/month (Ultra). Free access exists via MiniMax Code desktop app, HuggingChat, and OpenRouter free tier.
How does MiniMax M3 compare to DeepSeek V4?
Both are Chinese open-weights models at aggressive price points. M3's differentiators are native multimodal input (images and video) and the MSA sparse attention mechanism enabling practical 1M-context use. DeepSeek V4 has a more permissive license and likely stronger pure reasoning scores. M3 suits teams that need vision input combined with long-context coding in a single model call.
Can I run MiniMax M3 on my own hardware?
Yes, but you need serious hardware. The 428B model activates about 23B parameters per token through its MoE architecture, which helps, but full BF16 inference still requires 4-8 high-end GPUs (A100-80GB or H100 class). Eleven quantized variants on Hugging Face lower this. vLLM, SGLang, and Transformers all support it, though the novel MSA architecture means inference recipes are less mature than for standard Transformer models.
What are MiniMax M3's weaknesses?
Abstract reasoning is the clearest gap. ARC-AGI-2 scores are in the low single digits. The thinking mode can burn excessive tokens on simple tasks, inflating effective API cost. The minimax-community license restricts real commercial use without MiniMax's written approval. Deployment complexity from the novel MSA architecture means fewer out-of-the-box recipes compared to standard Transformer models.
Should I pick MiniMax M3 over GPT-5.5 or Claude Opus?
Pick M3 when API cost per task matters more than absolute peak accuracy, or when you need native image/video input plus coding in the same call. Stick with GPT-5.5 for the strongest terminal agent scores or Claude Opus for careful refactors and honest test feedback. M3 is a compelling middleweight: better than many open alternatives, cheaper than closed frontier models, but not the outright winner on any single dimension.
More in Models
View all
GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)
GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.
Models

Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing
Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.
Models

US Government Blocks Anthropic Fable 5 & Mythos 5 (2026)
US government ban on Anthropic: Commerce Dept ordered suspension of Fable 5 & Mythos 5 on June 12, 2026. Full timeline of the 4-month feud.
Models
More stories
View all
Siri AI Review (2026): Apple's Rebuilt Assistant vs ChatGPT & Gemini [Tested]
Siri AI is Apple's rebuilt assistant for 2026. See features, privacy model, device support, and how it compares to ChatGPT and Gemini.
Review

Claude Fable 5 Release (2026): Anthropic's Most Powerful AI Model Explained
Claude Fable 5 is the first Mythos-class model available to the public. State-of-the-art coding, vision, and knowledge work with new safeguards. Pricing, benchmarks, and what it means.
Models

Ideogram AI Review (2026): Free Tier Tested, vs Midjourney & Recraft
Ideogram AI review (2026): we tested free tier, pricing, text rendering, and Ideogram 4.0 vs Midjourney and Recraft. Who should use it?
Review

Genspark Speakly Review (2026): Pricing, Accuracy & Is It Worth It?
Honest genspark speakly review after hands-on testing. See Speakly pricing, accuracy, free tier limits, and how it compares to Otter and Whisper.
Review