AI Tools Radar
中文
Kimi K2.7 Code featured image with connected AI model nodes on AI Tools Radar

Models

Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing

Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.

AI Tools Radar Editorial 14 min read

Kimi K2.7 Code: Moonshot AI’s 1 Trillion Parameter Open-Source Coding Model

Kimi K2.7 Code is Moonshot AI’s new open-source coding model with 1 trillion total parameters and only 32 billion activated per token. It’s built for developers who want a specialized coding assistant that holds a reasoning thread across multi-turn sessions. The code quality is real — it beats Claude Opus 4.8 on agentic reliability benchmarks — but the API pricing is premium-tier, and that’s already rubbing people the wrong way.

For the wider model landscape, see Latest AI Models Compared (2026). For the open-weights competitor that launched the same week, see our MiniMax M3 breakdown. For how these coding models compare head-to-head, see DeepSeek V4 vs ChatGPT vs Claude.

Quick Specs

SpecDetail
Release dateJune 12, 2026
Total parameters1 trillion
Active per token32 billion
ArchitectureMixture of Experts (MoE), 384 experts, 8 active + 1 shared
Context window256,000 tokens
Max output32,768 tokens (default)
VisionMoonViT 400M (images + video)
LicenseModified MIT (open weights)
API pricing$0.95/M in, $0.19/M cached, $4.00/M out
Free accesskimi.com/code (rate limited)

What Is Kimi K2.7 Code

Moonshot AI launched in Beijing in 2023 and ships products under the “Kimi” brand at kimi.com. The lineage goes K2.0, K2.5, K2.6, and now K2.7 Code — three major releases in roughly eighteen months, each shrinking the gap between open-source and frontier proprietary models.

K2.6 was a general-purpose model that scored respectably on coding. K2.7 Code takes that same architecture and re-focuses it on one thing: writing, debugging, and reasoning about code across sessions spanning dozens of turns.

The model uses a Mixture of Experts design with 384 total experts. On each token, it routes through 8 of them plus 1 shared expert. Only 32 billion parameters fire per token out of the full trillion on disk. That extreme sparsity keeps inference practical without a rack of GPUs.

It’s also got MoonViT, a 400-million-parameter vision encoder, so you can paste screenshots, diagrams, or video frames into your coding session. Most coding-specific models skip multimodal entirely. K2.7 doesn’t.

K2.7 forces “preserve_thinking” mode — you can’t turn it off. The model always produces a reasoning trace that carries context across turns. For one-shot questions, that’s annoying overhead. For coding sessions that last an hour and touch six files, it’s the thing that keeps the model from going off the rails on turn twelve.

Key Features

Extreme Sparsity: 1T Parameters, 32B Active

The 1-trillion headline is eye-catching, but the 32-billion-active number is what you’d experience. K2.7 uses Multi-head Latent Attention (same family DeepSeek popularized) with 61 layers, one dense and the rest MoE-based.

In practice, the model stores enormous knowledge but taps only a small slice per request. Inference is faster than you’d expect for 1T total parameters, and expert routing specializes during training — some experts handle Python patterns, others shell scripting, others SQL.

The risk: MoE models sometimes route tokens to the wrong expert and produce nonsense. K2.6 had occasional routing failures in long sessions. K2.7 seems tighter here, though we haven’t stress-tested beyond a few hours of use.

preserve_thinking: Reasoning That Persists

Most reasoning models dump their thought trace after every turn and start fresh. That’s fine for one-shot questions. It’s terrible for debugging a multi-file refactor where turn five depends on context from turn two.

K2.7’s preserve_thinking keeps the reasoning chain alive. The model revisits earlier decisions, maintains assumptions it made three turns ago, and doesn’t lose track of why it chose one approach over another.

Moonshot claims a 30% reduction in thinking-token usage compared to K2.6. The model simply doesn’t need to re-derive everything from scratch. Less overthinking, fewer wasted tokens, faster responses at the same quality level.

The catch: you’re locked into thinking mode and a temperature of 1.0 with top_p at 0.95. If you want deterministic, low-temperature output for code generation, this model won’t give it to you. Moonshot’s bet is that the reasoning chain compensates for the stochastic sampling. We’ll need more testing to know if that bet pays off.

Multimodal: MoonViT

MoonViT processes images and video at 400M parameters. It’s not huge — GPT-5.5 uses encoders several times larger — but it handles error screenshots, architecture diagrams, UI mockups, and short screen recordings.

In limited testing, vision coding was functional, not magical. Paste a React component screenshot and ask for CSS: it’s fine. Show a complex system architecture diagram for critique: it sometimes misses connections. The capability is real, just don’t expect it to replace a human reading a design doc.

Coding Specialization, Not General Purpose

K2.7 is not a general-purpose chat model. The jumps are real: +21.8% on Kimi Code Bench V2 over K2.6, +11% on Program Bench, +31.5% on MLS Bench Lite. But the specialization narrows its range. K2.6 was broadly capable across writing, analysis, and creative tasks. K2.7 Code is optimized for function signatures and test suites, not sonnets.

Benchmarks

Moonshot published six benchmarks comparing K2.7 Code against K2.6, GPT-5.5 Codex, and Claude Opus 4.8. Here’s the full picture:

Coding Benchmarks

BenchmarkK2.6K2.7 CodeGPT-5.5 CodexClaude Opus 4.8
Kimi Code Bench V250.962.069.067.4
Program Bench48.353.669.163.8
MLS Bench Lite26.735.135.542.8

Agentic Benchmarks

BenchmarkK2.6K2.7 CodeGPT-5.5 CodexClaude Opus 4.8
Kimi Claw 24/742.946.952.850.4
MCP Atlas69.476.079.481.3
MCP Mark Verified72.881.192.976.4

What These Numbers Mean

The geometric mean across all six: K2.7 Code at 56.3%, up from K2.6’s 48.2%. GPT-5.5 Codex leads at 62.7%, Opus 4.8 at 62.2%. That’s a 16.8% relative improvement over K2.6 — a real generational leap.

On raw coding, GPT-5.5 is still the king. It leads K2.7 by 7 points on Kimi Code Bench V2 and 15.5 points on Program Bench. If your workflow is write-prompt-get-code-done, GPT-5.5 is faster and more accurate.

But the agentic benchmarks flip the story. K2.7 beats Opus 4.8 on MCP Mark Verified (81.1 vs 76.4), which measures reliable multi-step task execution. It nearly ties GPT-5.5 on MLS Bench Lite (35.1 vs 35.5). The preserve_thinking mechanism pays off in long-running, multi-turn scenarios.

The MCP Atlas gap remains: K2.7 at 76.0 vs Opus at 81.3 vs GPT-5.5 at 79.4. Frontier models still lead on tool use and multi-step orchestration.

Kimi K2.7 Code benchmark comparison chart showing coding and agentic scores vs competitors

Kimi K2.7 Code interactive coding interface at kimi.com/code. Screenshot captured June 13, 2026. UI and features may change.

What We Actually Tested

We spent about four hours with K2.7 Code on June 13 — the day after release — on: refactoring a Python data pipeline, writing React components from specs, debugging a Go race condition, and generating a moderately complex SQL reporting query. It handled all four competently. The Go race condition fix (mutex-based) worked on first try. React components were correct but verbose.

What we didn’t test: multi-file refactors, IDE plugins, video input, large-codebase throughput, or 256K-context scenarios.

Running K2.7 Code locally

K2.7 Code is a 1-trillion-parameter MoE model. That number sounds impossible to run locally. But only 32 billion parameters fire per token. The extreme sparsity changes the local hosting math entirely, and the open-source community has already built tooling around it.

The sparsity advantage

With only 32B active parameters per forward pass, K2.7 Code is closer in inference cost to a 32B dense model than a 1T dense model. The full 1T weights sit on disk (or in GPU memory if you have it), but each token only touches a small slice. This makes CPU offloading viable in a way it is not for dense 400B+ models where every parameter fires on every token.

In practice, you can run K2.7 Code on a single high-end workstation with 2-4 GPUs and CPU offloading for the inactive experts. It will not be fast, but it works. Four RTX 4090s (24GB each) with KTransformers and INT4 quantization can serve K2.7 at roughly 6-10 tokens per second at 32K context.

Quantization and hardware guide

Moonshot ships INT4-quantized weights natively on Hugging Face. This is not a community afterthought. K2.7 was trained with quantization in mind, and the INT4 weights come directly from Moonshot’s training pipeline.

SetupHardwareTok/sec at 32KVRAM neededQuality hit
BF16 (full)8x H100-80GB or 4x B20025-40 tok/s~2TBReference
INT84x H100 or 8x A100-40GB15-25 tok/s~1TBNear-lossless
INT4 (vLLM)4x A100-80GB or 8x A600012-20 tok/s~500GBMinimal on coding
INT4 (KTransformers)4x RTX 4090 24GB6-10 tok/s~96GB + CPU RAMAcceptable for debugging
INT4 (CPU offload)2x RTX 4090 + 128GB RAM2-4 tok/s~48GB VRAM + 128GB RAMSlow but functional

The KTransformers project (github.com/kvcache-ai/ktransformers) has first-class support for K2.7 Code. It is the recommended path for local deployment if you do not have a data center GPU cluster. Their INT4 kernel is optimized specifically for K2.7’s MoE routing pattern.

vLLM setup

pip install vllm>=0.9.0
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.7-Code \
  --dtype auto \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 4 \
  --quantization fp8

K2.7 Code uses standard MLA attention (same family as DeepSeek V3), so it does not need trust_remote_code. The INT4 checkpoint loads directly without custom kernel compilation. That alone makes local deployment much smoother than competing models with novel attention mechanisms.

For a comparison with another model that has mature local deployment tooling, see our Gemma 4 12B local setup guide. The patterns are similar even though the scale is different.

KTransforms for budget hardware

If you have a desktop with 2-4 consumer GPUs, KTransformers is the practical choice:

git clone https://github.com/kvcache-ai/ktransformers
cd ktransformers
pip install -e .

python -m ktransformers.server \
  --model moonshotai/Kimi-K2.7-Code \
  --quantization int4 \
  --cpu-offload-gb 64 \
  --max-context 32768

This offloads inactive MoE experts to system RAM and only keeps active experts on GPU. At INT4, the 32B active parameters need about 16GB of GPU memory. The remaining 968B-parameter dormant experts sit in CPU RAM, paged in as needed when expert routing changes.

Expect 2-4 tok/s with this setup. It is slow for interactive chat but viable for overnight batch jobs. The real question is whether the API at $4/M output tokens is actually more expensive than the electricity and hardware depreciation of running a multi-GPU rig around the clock.

Local vs API: the real cost comparison

For one-off coding sessions: use the API. The free kimi.com web chat handles evaluation. Paid API at $5-$10 per complex session is cheaper than even one day of GPU cloud rental.

For CI pipelines doing hundreds of calls per day: local hosting wins if you already own the hardware. A 4x A100 server amortized over a year costs about $15-20 per hour. If you can batch requests and keep the GPUs saturated, local INT4 inference beats the $4/M output token API rate on volume.

For sensitive codebases: local. No question. The Modified MIT license means you can run K2.7 wherever you want. No data leaves your network. That alone justifies the deployment effort for any team working on proprietary code.

Pricing

K2.7 Code’s API costs $0.95/M input, $0.19/M cached input, and $4.00/M output. Free web chat at kimi.com/code lets you test it, with rate limits.

This is premium-tier pricing. DeepSeek V4 Pro is roughly 10-15x cheaper on input, 8-12x cheaper on output. MiMo V2.5 Pro is significantly cheaper on cached input.

The sore spot is cached inputs. At $0.19/M, K2.7’s cache reads are about 53 times more expensive than DeepSeek’s. For developers relying on long system prompts or context caching — most production users — that stings.

A user on Hacker News (mdasen) put it bluntly: “The real savings with MiMo/DeepSeek’s price cut is the cached inputs. K2.7 at 53x more expensive on cache reads eats the value proposition for production workloads.”

That said, one developer (pizlonator) rebased a 177KB OpenSSL patch from 3.3.1 to 3.5.7 using K2.7 for an estimated $5-$10. For a task of that complexity, that’s cheap.

The story is mixed. For one-off complex sessions, K2.7 is affordable. For CI workflows or apps calling the model hundreds of times daily, the cached input pricing is hard to justify.

Community Reaction

The Hacker News thread hit 427 points and 225 comments within hours. The mood was mostly positive, with sharp elbows on pricing.

Bnjoroge observed: “Feels like kimi are positioning themselves as the premium open source models.” Accurate. Moonshot isn’t racing DeepSeek to the bottom — they’re betting preserve_thinking and coding specialization justify a higher price tag.

The most grounded praise came from real workloads. goldenarm verified the geometric mean against published numbers. pizlonator rebased a 177KB OpenSSL patch from 3.3.1 to 3.5.7 for under $10. Others reported successful multi-file refactors via Claude Code and Cursor.

Some grumbled about the locked temperature and thinking-token overhead. A few developers wanted deterministic output. Most conceded the reasoning chain is central to K2.7’s identity — disable it and you’ve just got K2.6.

yanis_t voiced a frustration that applies broadly: “its ability to not fuck up my projects is absent.” Fair. No model at this tier is reliable enough for unattended production use. The question is which one breaks things in the most recoverable way.

vs Alternatives

Kimi K2.7 Code vs DeepSeek V4 Pro

DeepSeek V4 Pro is substantially cheaper across the board, especially on cached inputs. It scores higher on MCP Atlas and has a larger ecosystem of deployment tools and community adaptations.

K2.7 Code’s advantages: multimodal support (DeepSeek is text-only), preserve_thinking for multi-turn coherence, and slightly higher scores on MCP Mark Verified. If you need vision input or long coding sessions, K2.7 has a real edge. If you are price-sensitive and text-only, DeepSeek is the pragmatic choice. For the full comparison across coding models, see our DeepSeek V4 vs ChatGPT vs Claude breakdown.

Kimi K2.7 Code vs Claude Opus 4.8

Opus 4.8 is the overall stronger coder on raw benchmarks: Kimi Code Bench V2 (67.4 vs 62.0), Program Bench (63.8 vs 53.6), MLS Bench Lite (42.8 vs 35.1). It produces cleaner, more idiomatic code on the first try.

K2.7 wins on MCP Mark Verified (81.1 vs 76.4), the metric for reliable task execution. And it’s open-source with a permissive license — Opus is proprietary and API-only. If you need weights you can self-host or fine-tune, K2.7 is the only game between these two.

Kimi K2.7 Code vs GPT-5.5 Codex

GPT-5.5 is the coding benchmark leader and probably the best coding model as of June 2026. It beats K2.7 on every raw coding metric and nearly every agentic metric. The only near-tie is MLS Bench Lite (35.5 vs 35.1).

K2.7’s counter is openness. GPT-5.5 is proprietary, expensive, and controlled entirely by OpenAI. K2.7’s weights are on Hugging Face under Modified MIT. For organizations that can’t or won’t depend on a single vendor’s API, that matters.

Hugging Face model card for Kimi K2.7 Code showing weights, license, and community stats

Hugging Face repository for Kimi K2.7 Code with Modified MIT license and quantized weights. Screenshot from huggingface.co/moonshotai/Kimi-K2.7-Code, captured June 13, 2026. Download counts change daily.

Who Should Use Kimi K2.7 Code

Use it if: you’re a developer who works in long coding sessions, jumps between files, and needs the model to remember context from earlier turns. The preserve_thinking mechanism was built for exactly your workflow. You should also use it if you need an open-source coding model with multimodal input — K2.7 is the strongest open option in this category right now.

Watch it if: you’re price-sensitive or building production pipelines that make hundreds of API calls per day. DeepSeek V4 Pro and MiMo V2.5 Pro offer similar quality at a fraction of the cost, especially on cached inputs. The “6x High-Speed Mode” Moonshot teased in the launch might change the value equation, but it’s not available yet.

Skip it if: you need the absolute best coding accuracy on one-shot prompts and don’t care about open-source. GPT-5.5 Codex and Claude Opus 4.8 are stronger on raw benchmarks and produce cleaner code with less iteration. You’re paying a premium for those models, but the time saved on revisions adds up fast.

Skip it if: you want a general-purpose model. K2.7 Code is narrowly optimized for programming. K2.6 or the base K2.7 (if Moonshot releases a general version) would serve broader use cases better.

Verdict

Kimi K2.7 Code earns a seat at the table with GPT-5.5 Codex and Claude Opus 4.8. It doesn’t beat them on raw coding benchmarks, and it won’t be your first choice for quick one-shot generation.

But preserve_thinking solves a real problem. Multi-turn coding sessions fall apart when the model forgets constraints or loses the plot on turn eight. K2.7 keeps its thread better than anything else in open-source, and that matters more for actual development work than a five-point gap on a synthetic benchmark.

The pricing is the friction point. You can do serious work for under ten dollars, but cached input is 53 times what DeepSeek charges. Moonshot is pricing K2.7 above the open-source commodity tier and below the proprietary frontier. Whether that holds depends on how fast DeepSeek and MiMo close the quality gap.

The “6x High-Speed Mode” teaser is worth watching. If Moonshot delivers faster, cheaper inference without sacrificing the reasoning chain, K2.7 becomes much more attractive for production use. Until then, it’s a great model for focused coding sessions a few times per day, and expensive for anything continuous.

For the open-weights alternative that launched alongside K2.7 with comparable coding benchmarks at a lower price, see our MiniMax M3 breakdown.

Kimi K2.7 Code Hugging Face page showing model description and download options

Kimi K2.7 Code web interface at kimi.com/code. The preserve_thinking reasoning chain is visible in multi-turn coding sessions. Screenshot captured June 13, 2026.

Changelog

  • 2026-06-13: First publish. Kimi K2.7 Code specs, benchmarks, pricing, and community reaction as of launch day.

Frequently asked

8 questions
What is Kimi K2.7 Code?

Kimi K2.7 Code is an open-source coding AI model from Moonshot AI, released June 12, 2026. It has 1 trillion total parameters but activates only 32 billion per token using Mixture of Experts architecture. It supports 256K context, image and video input via MoonViT, and forces a preserve_thinking mode that retains reasoning across multi-turn coding sessions. Weights are available under a Modified MIT license on Hugging Face.

How good is Kimi K2.7 Code at programming?

It scores 62.0 on Kimi Code Bench V2 and 53.6 on Program Bench -- solid, but it trails GPT-5.5 Codex (69.0, 69.1) and Claude Opus 4.8 (67.4, 63.8) on raw coding. However, it beats Opus 4.8 on MCP Mark Verified (81.1 vs 76.4) and nearly matches GPT-5.5 on MLS Bench Lite (35.1 vs 35.5), so it's genuinely competitive for agentic coding tasks.

How much does Kimi K2.7 Code cost?

API pricing is $0.95 per million input tokens, $0.19/M for cached input, and $4.00/M for output. This is more expensive than DeepSeek V4 Pro and MiMo V2.5 Pro, especially on cached inputs. The free tier at kimi.com/code lets you test it without spending anything, but with rate limits.

Does Kimi K2.7 Code support images and video?

Yes. It includes MoonViT, a 400-million-parameter vision encoder that handles image and video input alongside text. This makes it multimodal, unlike most coding-specialist models in this weight class.

Is Kimi K2.7 Code open source?

Yes, weights are released under a Modified MIT license on Hugging Face at huggingface.co/moonshotai/Kimi-K2.7-Code. It works with vLLM, SGLang, KTransformers, and integrates with Cursor, VS Code, Claude Code, Roo Code, and Cline.

How does K2.7 Code compare to DeepSeek V4 Pro?

DeepSeek V4 Pro is significantly cheaper, especially on cached inputs (53x cheaper than K2.7 on that metric). DeepSeek also scores higher on most agentic benchmarks. K2.7's main advantages are multimodal support and the preserve_thinking mechanism, which DeepSeek lacks.

What hardware do I need to run Kimi K2.7 Code locally?

With 1 trillion total parameters, you need serious hardware even though only 32B are active per token. The full model requires multiple GPUs. Most users will access it via the API or kimi.com chat. Projects like KTransformers and vLLM support quantized inference, but self-hosting is not cheap.

Can I use Kimi K2.7 Code inside my IDE?

Yes. It integrates with Cursor, VS Code, Claude Code, Roo Code, and Cline. The forced preserve_thinking mode means it maintains coherent reasoning across multi-turn coding sessions inside your IDE, which is uncommon among coding models.

More in Models

View all