Models
Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing
Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.
Kimi K2.7 Code: Moonshot AI’s 1 Trillion Parameter Open-Source Coding Model
Kimi K2.7 Code is Moonshot AI’s new open-source coding model with 1 trillion total parameters and only 32 billion activated per token. It’s built for developers who want a specialized coding assistant that holds a reasoning thread across multi-turn sessions. The code quality is real — it beats Claude Opus 4.8 on agentic reliability benchmarks — but the API pricing is premium-tier, and that’s already rubbing people the wrong way.
For the wider model landscape, see Latest AI Models Compared (2026). For the open-weights competitor that launched the same week, see our MiniMax M3 breakdown. For how these coding models compare head-to-head, see DeepSeek V4 vs ChatGPT vs Claude.
Quick Specs
| Spec | Detail |
|---|---|
| Release date | June 12, 2026 |
| Total parameters | 1 trillion |
| Active per token | 32 billion |
| Architecture | Mixture of Experts (MoE), 384 experts, 8 active + 1 shared |
| Context window | 256,000 tokens |
| Max output | 32,768 tokens (default) |
| Vision | MoonViT 400M (images + video) |
| License | Modified MIT (open weights) |
| API pricing | $0.95/M in, $0.19/M cached, $4.00/M out |
| Free access | kimi.com/code (rate limited) |
What Is Kimi K2.7 Code
Moonshot AI launched in Beijing in 2023 and ships products under the “Kimi” brand at kimi.com. The lineage goes K2.0, K2.5, K2.6, and now K2.7 Code — three major releases in roughly eighteen months, each shrinking the gap between open-source and frontier proprietary models.
K2.6 was a general-purpose model that scored respectably on coding. K2.7 Code takes that same architecture and re-focuses it on one thing: writing, debugging, and reasoning about code across sessions spanning dozens of turns.
The model uses a Mixture of Experts design with 384 total experts. On each token, it routes through 8 of them plus 1 shared expert. Only 32 billion parameters fire per token out of the full trillion on disk. That extreme sparsity keeps inference practical without a rack of GPUs.
It’s also got MoonViT, a 400-million-parameter vision encoder, so you can paste screenshots, diagrams, or video frames into your coding session. Most coding-specific models skip multimodal entirely. K2.7 doesn’t.
K2.7 forces “preserve_thinking” mode — you can’t turn it off. The model always produces a reasoning trace that carries context across turns. For one-shot questions, that’s annoying overhead. For coding sessions that last an hour and touch six files, it’s the thing that keeps the model from going off the rails on turn twelve.
Key Features
Extreme Sparsity: 1T Parameters, 32B Active
The 1-trillion headline is eye-catching, but the 32-billion-active number is what you’d experience. K2.7 uses Multi-head Latent Attention (same family DeepSeek popularized) with 61 layers, one dense and the rest MoE-based.
In practice, the model stores enormous knowledge but taps only a small slice per request. Inference is faster than you’d expect for 1T total parameters, and expert routing specializes during training — some experts handle Python patterns, others shell scripting, others SQL.
The risk: MoE models sometimes route tokens to the wrong expert and produce nonsense. K2.6 had occasional routing failures in long sessions. K2.7 seems tighter here, though we haven’t stress-tested beyond a few hours of use.
preserve_thinking: Reasoning That Persists
Most reasoning models dump their thought trace after every turn and start fresh. That’s fine for one-shot questions. It’s terrible for debugging a multi-file refactor where turn five depends on context from turn two.
K2.7’s preserve_thinking keeps the reasoning chain alive. The model revisits earlier decisions, maintains assumptions it made three turns ago, and doesn’t lose track of why it chose one approach over another.
Moonshot claims a 30% reduction in thinking-token usage compared to K2.6. The model simply doesn’t need to re-derive everything from scratch. Less overthinking, fewer wasted tokens, faster responses at the same quality level.
The catch: you’re locked into thinking mode and a temperature of 1.0 with top_p at 0.95. If you want deterministic, low-temperature output for code generation, this model won’t give it to you. Moonshot’s bet is that the reasoning chain compensates for the stochastic sampling. We’ll need more testing to know if that bet pays off.
Multimodal: MoonViT
MoonViT processes images and video at 400M parameters. It’s not huge — GPT-5.5 uses encoders several times larger — but it handles error screenshots, architecture diagrams, UI mockups, and short screen recordings.
In limited testing, vision coding was functional, not magical. Paste a React component screenshot and ask for CSS: it’s fine. Show a complex system architecture diagram for critique: it sometimes misses connections. The capability is real, just don’t expect it to replace a human reading a design doc.
Coding Specialization, Not General Purpose
K2.7 is not a general-purpose chat model. The jumps are real: +21.8% on Kimi Code Bench V2 over K2.6, +11% on Program Bench, +31.5% on MLS Bench Lite. But the specialization narrows its range. K2.6 was broadly capable across writing, analysis, and creative tasks. K2.7 Code is optimized for function signatures and test suites, not sonnets.
Benchmarks
Moonshot published six benchmarks comparing K2.7 Code against K2.6, GPT-5.5 Codex, and Claude Opus 4.8. Here’s the full picture:
Coding Benchmarks
| Benchmark | K2.6 | K2.7 Code | GPT-5.5 Codex | Claude Opus 4.8 |
|---|---|---|---|---|
| Kimi Code Bench V2 | 50.9 | 62.0 | 69.0 | 67.4 |
| Program Bench | 48.3 | 53.6 | 69.1 | 63.8 |
| MLS Bench Lite | 26.7 | 35.1 | 35.5 | 42.8 |
Agentic Benchmarks
| Benchmark | K2.6 | K2.7 Code | GPT-5.5 Codex | Claude Opus 4.8 |
|---|---|---|---|---|
| Kimi Claw 24/7 | 42.9 | 46.9 | 52.8 | 50.4 |
| MCP Atlas | 69.4 | 76.0 | 79.4 | 81.3 |
| MCP Mark Verified | 72.8 | 81.1 | 92.9 | 76.4 |
What These Numbers Mean
The geometric mean across all six: K2.7 Code at 56.3%, up from K2.6’s 48.2%. GPT-5.5 Codex leads at 62.7%, Opus 4.8 at 62.2%. That’s a 16.8% relative improvement over K2.6 — a real generational leap.
On raw coding, GPT-5.5 is still the king. It leads K2.7 by 7 points on Kimi Code Bench V2 and 15.5 points on Program Bench. If your workflow is write-prompt-get-code-done, GPT-5.5 is faster and more accurate.
But the agentic benchmarks flip the story. K2.7 beats Opus 4.8 on MCP Mark Verified (81.1 vs 76.4), which measures reliable multi-step task execution. It nearly ties GPT-5.5 on MLS Bench Lite (35.1 vs 35.5). The preserve_thinking mechanism pays off in long-running, multi-turn scenarios.
The MCP Atlas gap remains: K2.7 at 76.0 vs Opus at 81.3 vs GPT-5.5 at 79.4. Frontier models still lead on tool use and multi-step orchestration.

What We Actually Tested
We spent about four hours with K2.7 Code on June 13 — the day after release — on: refactoring a Python data pipeline, writing React components from specs, debugging a Go race condition, and generating a moderately complex SQL reporting query. It handled all four competently. The Go race condition fix (mutex-based) worked on first try. React components were correct but verbose.
What we didn’t test: multi-file refactors, IDE plugins, video input, large-codebase throughput, or 256K-context scenarios.
Running K2.7 Code locally
K2.7 Code is a 1-trillion-parameter MoE model. That number sounds impossible to run locally. But only 32 billion parameters fire per token. The extreme sparsity changes the local hosting math entirely, and the open-source community has already built tooling around it.
The sparsity advantage
With only 32B active parameters per forward pass, K2.7 Code is closer in inference cost to a 32B dense model than a 1T dense model. The full 1T weights sit on disk (or in GPU memory if you have it), but each token only touches a small slice. This makes CPU offloading viable in a way it is not for dense 400B+ models where every parameter fires on every token.
In practice, you can run K2.7 Code on a single high-end workstation with 2-4 GPUs and CPU offloading for the inactive experts. It will not be fast, but it works. Four RTX 4090s (24GB each) with KTransformers and INT4 quantization can serve K2.7 at roughly 6-10 tokens per second at 32K context.
Quantization and hardware guide
Moonshot ships INT4-quantized weights natively on Hugging Face. This is not a community afterthought. K2.7 was trained with quantization in mind, and the INT4 weights come directly from Moonshot’s training pipeline.
| Setup | Hardware | Tok/sec at 32K | VRAM needed | Quality hit |
|---|---|---|---|---|
| BF16 (full) | 8x H100-80GB or 4x B200 | 25-40 tok/s | ~2TB | Reference |
| INT8 | 4x H100 or 8x A100-40GB | 15-25 tok/s | ~1TB | Near-lossless |
| INT4 (vLLM) | 4x A100-80GB or 8x A6000 | 12-20 tok/s | ~500GB | Minimal on coding |
| INT4 (KTransformers) | 4x RTX 4090 24GB | 6-10 tok/s | ~96GB + CPU RAM | Acceptable for debugging |
| INT4 (CPU offload) | 2x RTX 4090 + 128GB RAM | 2-4 tok/s | ~48GB VRAM + 128GB RAM | Slow but functional |
The KTransformers project (github.com/kvcache-ai/ktransformers) has first-class support for K2.7 Code. It is the recommended path for local deployment if you do not have a data center GPU cluster. Their INT4 kernel is optimized specifically for K2.7’s MoE routing pattern.
vLLM setup
pip install vllm>=0.9.0
python -m vllm.entrypoints.openai.api_server \
--model moonshotai/Kimi-K2.7-Code \
--dtype auto \
--max-model-len 65536 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 4 \
--quantization fp8K2.7 Code uses standard MLA attention (same family as DeepSeek V3), so it does not need trust_remote_code. The INT4 checkpoint loads directly without custom kernel compilation. That alone makes local deployment much smoother than competing models with novel attention mechanisms.
For a comparison with another model that has mature local deployment tooling, see our Gemma 4 12B local setup guide. The patterns are similar even though the scale is different.
KTransforms for budget hardware
If you have a desktop with 2-4 consumer GPUs, KTransformers is the practical choice:
git clone https://github.com/kvcache-ai/ktransformers
cd ktransformers
pip install -e .
python -m ktransformers.server \
--model moonshotai/Kimi-K2.7-Code \
--quantization int4 \
--cpu-offload-gb 64 \
--max-context 32768This offloads inactive MoE experts to system RAM and only keeps active experts on GPU. At INT4, the 32B active parameters need about 16GB of GPU memory. The remaining 968B-parameter dormant experts sit in CPU RAM, paged in as needed when expert routing changes.
Expect 2-4 tok/s with this setup. It is slow for interactive chat but viable for overnight batch jobs. The real question is whether the API at $4/M output tokens is actually more expensive than the electricity and hardware depreciation of running a multi-GPU rig around the clock.
Local vs API: the real cost comparison
For one-off coding sessions: use the API. The free kimi.com web chat handles evaluation. Paid API at $5-$10 per complex session is cheaper than even one day of GPU cloud rental.
For CI pipelines doing hundreds of calls per day: local hosting wins if you already own the hardware. A 4x A100 server amortized over a year costs about $15-20 per hour. If you can batch requests and keep the GPUs saturated, local INT4 inference beats the $4/M output token API rate on volume.
For sensitive codebases: local. No question. The Modified MIT license means you can run K2.7 wherever you want. No data leaves your network. That alone justifies the deployment effort for any team working on proprietary code.
Pricing
K2.7 Code’s API costs $0.95/M input, $0.19/M cached input, and $4.00/M output. Free web chat at kimi.com/code lets you test it, with rate limits.
This is premium-tier pricing. DeepSeek V4 Pro is roughly 10-15x cheaper on input, 8-12x cheaper on output. MiMo V2.5 Pro is significantly cheaper on cached input.
The sore spot is cached inputs. At $0.19/M, K2.7’s cache reads are about 53 times more expensive than DeepSeek’s. For developers relying on long system prompts or context caching — most production users — that stings.
A user on Hacker News (mdasen) put it bluntly: “The real savings with MiMo/DeepSeek’s price cut is the cached inputs. K2.7 at 53x more expensive on cache reads eats the value proposition for production workloads.”
That said, one developer (pizlonator) rebased a 177KB OpenSSL patch from 3.3.1 to 3.5.7 using K2.7 for an estimated $5-$10. For a task of that complexity, that’s cheap.
The story is mixed. For one-off complex sessions, K2.7 is affordable. For CI workflows or apps calling the model hundreds of times daily, the cached input pricing is hard to justify.
Community Reaction
The Hacker News thread hit 427 points and 225 comments within hours. The mood was mostly positive, with sharp elbows on pricing.
Bnjoroge observed: “Feels like kimi are positioning themselves as the premium open source models.” Accurate. Moonshot isn’t racing DeepSeek to the bottom — they’re betting preserve_thinking and coding specialization justify a higher price tag.
The most grounded praise came from real workloads. goldenarm verified the geometric mean against published numbers. pizlonator rebased a 177KB OpenSSL patch from 3.3.1 to 3.5.7 for under $10. Others reported successful multi-file refactors via Claude Code and Cursor.
Some grumbled about the locked temperature and thinking-token overhead. A few developers wanted deterministic output. Most conceded the reasoning chain is central to K2.7’s identity — disable it and you’ve just got K2.6.
yanis_t voiced a frustration that applies broadly: “its ability to not fuck up my projects is absent.” Fair. No model at this tier is reliable enough for unattended production use. The question is which one breaks things in the most recoverable way.
vs Alternatives
Kimi K2.7 Code vs DeepSeek V4 Pro
DeepSeek V4 Pro is substantially cheaper across the board, especially on cached inputs. It scores higher on MCP Atlas and has a larger ecosystem of deployment tools and community adaptations.
K2.7 Code’s advantages: multimodal support (DeepSeek is text-only), preserve_thinking for multi-turn coherence, and slightly higher scores on MCP Mark Verified. If you need vision input or long coding sessions, K2.7 has a real edge. If you are price-sensitive and text-only, DeepSeek is the pragmatic choice. For the full comparison across coding models, see our DeepSeek V4 vs ChatGPT vs Claude breakdown.
Kimi K2.7 Code vs Claude Opus 4.8
Opus 4.8 is the overall stronger coder on raw benchmarks: Kimi Code Bench V2 (67.4 vs 62.0), Program Bench (63.8 vs 53.6), MLS Bench Lite (42.8 vs 35.1). It produces cleaner, more idiomatic code on the first try.
K2.7 wins on MCP Mark Verified (81.1 vs 76.4), the metric for reliable task execution. And it’s open-source with a permissive license — Opus is proprietary and API-only. If you need weights you can self-host or fine-tune, K2.7 is the only game between these two.
Kimi K2.7 Code vs GPT-5.5 Codex
GPT-5.5 is the coding benchmark leader and probably the best coding model as of June 2026. It beats K2.7 on every raw coding metric and nearly every agentic metric. The only near-tie is MLS Bench Lite (35.5 vs 35.1).
K2.7’s counter is openness. GPT-5.5 is proprietary, expensive, and controlled entirely by OpenAI. K2.7’s weights are on Hugging Face under Modified MIT. For organizations that can’t or won’t depend on a single vendor’s API, that matters.

Who Should Use Kimi K2.7 Code
Use it if: you’re a developer who works in long coding sessions, jumps between files, and needs the model to remember context from earlier turns. The preserve_thinking mechanism was built for exactly your workflow. You should also use it if you need an open-source coding model with multimodal input — K2.7 is the strongest open option in this category right now.
Watch it if: you’re price-sensitive or building production pipelines that make hundreds of API calls per day. DeepSeek V4 Pro and MiMo V2.5 Pro offer similar quality at a fraction of the cost, especially on cached inputs. The “6x High-Speed Mode” Moonshot teased in the launch might change the value equation, but it’s not available yet.
Skip it if: you need the absolute best coding accuracy on one-shot prompts and don’t care about open-source. GPT-5.5 Codex and Claude Opus 4.8 are stronger on raw benchmarks and produce cleaner code with less iteration. You’re paying a premium for those models, but the time saved on revisions adds up fast.
Skip it if: you want a general-purpose model. K2.7 Code is narrowly optimized for programming. K2.6 or the base K2.7 (if Moonshot releases a general version) would serve broader use cases better.
Verdict
Kimi K2.7 Code earns a seat at the table with GPT-5.5 Codex and Claude Opus 4.8. It doesn’t beat them on raw coding benchmarks, and it won’t be your first choice for quick one-shot generation.
But preserve_thinking solves a real problem. Multi-turn coding sessions fall apart when the model forgets constraints or loses the plot on turn eight. K2.7 keeps its thread better than anything else in open-source, and that matters more for actual development work than a five-point gap on a synthetic benchmark.
The pricing is the friction point. You can do serious work for under ten dollars, but cached input is 53 times what DeepSeek charges. Moonshot is pricing K2.7 above the open-source commodity tier and below the proprietary frontier. Whether that holds depends on how fast DeepSeek and MiMo close the quality gap.
The “6x High-Speed Mode” teaser is worth watching. If Moonshot delivers faster, cheaper inference without sacrificing the reasoning chain, K2.7 becomes much more attractive for production use. Until then, it’s a great model for focused coding sessions a few times per day, and expensive for anything continuous.
For the open-weights alternative that launched alongside K2.7 with comparable coding benchmarks at a lower price, see our MiniMax M3 breakdown.

Changelog
- 2026-06-13: First publish. Kimi K2.7 Code specs, benchmarks, pricing, and community reaction as of launch day.
Frequently asked
8 questionsWhat is Kimi K2.7 Code?
Kimi K2.7 Code is an open-source coding AI model from Moonshot AI, released June 12, 2026. It has 1 trillion total parameters but activates only 32 billion per token using Mixture of Experts architecture. It supports 256K context, image and video input via MoonViT, and forces a preserve_thinking mode that retains reasoning across multi-turn coding sessions. Weights are available under a Modified MIT license on Hugging Face.
How good is Kimi K2.7 Code at programming?
It scores 62.0 on Kimi Code Bench V2 and 53.6 on Program Bench -- solid, but it trails GPT-5.5 Codex (69.0, 69.1) and Claude Opus 4.8 (67.4, 63.8) on raw coding. However, it beats Opus 4.8 on MCP Mark Verified (81.1 vs 76.4) and nearly matches GPT-5.5 on MLS Bench Lite (35.1 vs 35.5), so it's genuinely competitive for agentic coding tasks.
How much does Kimi K2.7 Code cost?
API pricing is $0.95 per million input tokens, $0.19/M for cached input, and $4.00/M for output. This is more expensive than DeepSeek V4 Pro and MiMo V2.5 Pro, especially on cached inputs. The free tier at kimi.com/code lets you test it without spending anything, but with rate limits.
Does Kimi K2.7 Code support images and video?
Yes. It includes MoonViT, a 400-million-parameter vision encoder that handles image and video input alongside text. This makes it multimodal, unlike most coding-specialist models in this weight class.
Is Kimi K2.7 Code open source?
Yes, weights are released under a Modified MIT license on Hugging Face at huggingface.co/moonshotai/Kimi-K2.7-Code. It works with vLLM, SGLang, KTransformers, and integrates with Cursor, VS Code, Claude Code, Roo Code, and Cline.
How does K2.7 Code compare to DeepSeek V4 Pro?
DeepSeek V4 Pro is significantly cheaper, especially on cached inputs (53x cheaper than K2.7 on that metric). DeepSeek also scores higher on most agentic benchmarks. K2.7's main advantages are multimodal support and the preserve_thinking mechanism, which DeepSeek lacks.
What hardware do I need to run Kimi K2.7 Code locally?
With 1 trillion total parameters, you need serious hardware even though only 32B are active per token. The full model requires multiple GPUs. Most users will access it via the API or kimi.com chat. Projects like KTransformers and vLLM support quantized inference, but self-hosting is not cheap.
Can I use Kimi K2.7 Code inside my IDE?
Yes. It integrates with Cursor, VS Code, Claude Code, Roo Code, and Cline. The forced preserve_thinking mode means it maintains coherent reasoning across multi-turn coding sessions inside your IDE, which is uncommon among coding models.
More in Models
View all
GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)
GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.
Models

MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks
MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.
Models

US Government Blocks Anthropic Fable 5 & Mythos 5 (2026)
US government ban on Anthropic: Commerce Dept ordered suspension of Fable 5 & Mythos 5 on June 12, 2026. Full timeline of the 4-month feud.
Models
More stories
View all
Siri AI Review (2026): Apple's Rebuilt Assistant vs ChatGPT & Gemini [Tested]
Siri AI is Apple's rebuilt assistant for 2026. See features, privacy model, device support, and how it compares to ChatGPT and Gemini.
Review

Claude Fable 5 Release (2026): Anthropic's Most Powerful AI Model Explained
Claude Fable 5 is the first Mythos-class model available to the public. State-of-the-art coding, vision, and knowledge work with new safeguards. Pricing, benchmarks, and what it means.
Models

Ideogram AI Review (2026): Free Tier Tested, vs Midjourney & Recraft
Ideogram AI review (2026): we tested free tier, pricing, text rendering, and Ideogram 4.0 vs Midjourney and Recraft. Who should use it?
Review

Genspark Speakly Review (2026): Pricing, Accuracy & Is It Worth It?
Honest genspark speakly review after hands-on testing. See Speakly pricing, accuracy, free tier limits, and how it compares to Otter and Whisper.
Review