AI Tools Radar
中文
GLM-5.2 benchmark comparison chart showing scores against Claude Opus 4.8, GPT-5.5, and DeepSeek-V4-Pro

Models

GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)

GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.

AI Tools Radar Editorial 11 min read

Short answer (June 2026): GLM-5.2 is Zhipu AI’s 744B open-weight model under MIT license with 1M-token context and 384 Mixture of Experts routing ~40B parameters per token. It tops AIME 2026 at 99.2, trails Claude Opus 4.8 on coding, and costs roughly 10x less per token than either Opus or GPT-5.5. Text-only at launch. Weights on HuggingFace. The closest an open model has come to frontier proprietary performance.

For the full June 2026 model landscape, see Latest AI Models Compared (2026). For the coding model comparison it slots into, see DeepSeek V4 vs ChatGPT vs Claude. For all models in this category, see our Models hub.

Last updated: June 17, 2026. Live on aitoolsradar.org.

Quick Specs

SpecGLM-5.2
Release dateJune 13, 2026
DeveloperZhipu AI (Z.ai), formerly THUDM / Tsinghua University
Total parameters744 billion
Active per token~40 billion (MoE, 384 experts)
Context window1,000,000 tokens
Max output131,072 tokens
ModalityText-only (no vision at launch)
LicenseMIT (fully open weights)
AttentionIndexShare sparse attention (2.9x FLOP reduction at 1M context)
Speculative decodingImproved Multi-Token Prediction (MTP)
API pricing$1.40/M in, $0.26/M cached, $4.40/M out
HuggingFacezai-org/GLM-5.2, zai-org/GLM-5.2-FP8
GGUF quantsunsloth/GLM-5.2

What Is GLM-5.2

Zhipu AI shipped GLM-5.2 on June 13, 2026, exactly 24 hours after the US government ordered Anthropic’s Fable 5 offline. The timing drew attention. So did the specs: 744B parameters total, ~40B active per token, MIT license, 1M context, no usage restrictions.

The model family started at Tsinghua University and grew through Zhipu AI, the commercial spinoff. The lineage runs GLM-5 (February 11), GLM-5-Turbo (March 15), GLM-5.1 (April 7), and now GLM-5.2. Each version widened the gap with the previous one, but 5.2 is the first to land in genuine frontier territory on math and science benchmarks.

GLM-5.2 uses a Mixture of Experts architecture with 384 experts. On each token, it activates roughly 40 billion parameters. That’s a higher activation ratio than Kimi K2.7 (32B active out of 1T) and lower total parameters, which changes the compute profile: fewer experts to store, more parameters per expert, potentially better per-expert specialization.

The 1M context window relies on IndexShare sparse attention, which Zhipu claims reduces per-token FLOPs by 2.9x at full context length. That’s a big claim. Whether it holds on real-world retrieval tasks (not synthetic needle-in-haystack) remains to be seen. The 131,072 max output is also large enough for complete file generation in a single pass.

One notable absence: multimodal input. GLM-5.2 is text-only. Earlier models in the family (GLM-5V-Turbo) handled images. Zhipu stripped that out to focus the compute budget on text and code quality. If you need vision, you’ll have to wait for a future variant or use a different model.

Z.ai official blog page announcing GLM-5.2 Built for Long-Horizon Tasks on z.ai

Z.ai official announcement for GLM-5.2. Screenshot from z.ai/blog/glm-5.2, captured 2026-06-17. Page content may change.

Benchmarks

Zhipu didn’t publish benchmarks at launch. They appeared three days later, on June 16. That delay drew criticism. The numbers themselves, once published, told an interesting story.

BenchmarkGLM-5.2Claude Opus 4.8GPT-5.5DeepSeek-V4-ProQwen3.7-Max
AIME 202699.295.798.394.697.0
IMOAnswerBench91.083.5n/a89.890.0
GPQA-Diamond91.293.693.690.190.0
HLE40.549.841.437.741.4
SWE-bench Pro62.169.258.655.460.6
Terminal-Bench 2.181.085.084.064.075.0
NL2Repo48.969.750.735.547.2
DeepSWE46.258.070.08.018.0

What the Numbers Mean

Math and science are GLM-5.2’s strongest suit. The 99.2 on AIME 2026 is the highest published score from any model, open or closed. IMOAnswerBench at 91.0 also leads. GPQA-Diamond (graduate-level science) lands at 91.2, just behind Opus 4.8 and GPT-5.5 at 93.6.

Coding is competitive but not leading. SWE-bench Pro at 62.1 trails Opus 4.8 by 7.1 points. Terminal-Bench 2.1 at 81.0 is 4 points behind Opus and 3 behind GPT-5.5. NL2Repo at 48.9 is 20.8 points behind Opus. DeepSWE at 46.2 trails GPT-5.5 by nearly 24 points.

The honest read: GLM-5.2 is the best open-weight model available right now on math. On coding, it’s good but clearly behind Claude Opus 4.8 and (on some benchmarks) GPT-5.5. The gap narrows on shorter, more structured tasks and widens on open-ended software engineering.

One caveat: these are vendor-reported scores. Independent verification takes weeks. Treat them as directional, not final. GLM-5.2 beat DeepSeek-V4-Pro and Qwen3.7-Max on every benchmark listed here, which makes it the open-weight leader. Whether it can close the gap with Opus 4.8 on real coding workloads is a separate question.

GLM-5.2 full benchmark table from the official Z.ai blog showing scores across reasoning, coding, and agentic tasks

GLM-5.2 official benchmark results published June 16, 2026. Screenshot from z.ai/blog/glm-5.2, captured 2026-06-17. Scores are vendor-reported.

How People Are Using It

GLM-5.2 dropped into a crowded field but found real traction quickly. The Hacker News thread hit 616 upvotes with 340+ comments. Zhipu’s stock surged 48% intraday on the Hong Kong Stock Exchange before closing at +32.8%.

Long-horizon agentic coding. The 1M context window plus 131K max output makes GLM-5.2 attractive for multi-hour software engineering sessions. Developers report loading entire repositories into context and running multi-step refactors without losing coherence.

Codebase analysis. At 1M tokens, you can feed a mid-size repo into GLM-5.2 and ask structural questions. The IndexShare sparse attention is supposed to keep retrieval quality high even at extreme context lengths.

Tool-use workflows. GLM-5.2 supports function calling and MCP. It works with Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, and Kilo Code through OpenAI-compatible API endpoints. You point your coding tool at a vLLM server or an API provider, and it works.

Complex debugging. The sustained coherence over long contexts helps for debugging sessions where the model needs to track multiple files, error traces, and test outputs simultaneously.

Community criticism focused on two things. First, the three-day delay between launch and benchmarks felt like Zhipu was gauging reception before publishing numbers. Second, text-only at launch disappointed teams that needed vision input. The weights themselves were also delayed slightly after the announcement, though they’re now available on HuggingFace.

Pricing

GLM-5.2’s pricing is its clearest advantage over closed models.

ModelInput (per 1M tokens)Cached Input (per 1M)Output (per 1M tokens)
GLM-5.2$1.40$0.26$4.40
Claude Opus 4.8~$15.00varies~$75.00
GPT-5.5~$10.00varies~$30.00
DeepSeek-V4-Procheapercheapercheaper
GLM-4.5-FlashFREEFREEFREE
GLM-4.7-FlashFREEFREEFREE

The gap is enormous. GLM-5.2’s output costs $4.40/M versus Opus 4.8 at roughly $75/M. That’s about 17x cheaper. Input is roughly 10x cheaper than Opus. Even against GPT-5.5, it’s 7x cheaper on output.

For teams running continuous agent loops or batch processing, this changes the math entirely. A task that costs $75 on Opus costs $4.40 on GLM-5.2. Even if you need 2-3x more iterations to match Opus quality, you’re still spending far less.

GLM-5.2 is available on OpenRouter, GMI Cloud, Novita, and Cloudflare Workers AI. Zhipu also offers free access to the older Flash models (GLM-4.5-Flash, GLM-4.7-Flash), which are decent for lighter tasks. See our OpenRouter Free Models guide for more on free-tier options.

How to Run GLM-5.2 Locally

This is a 744B model. “Running locally” means different things depending on your budget.

Hardware Requirements

ConfigurationVRAM / RAM NeededHardware ExampleExpected Speed
BF16 (full precision)~1,500-1,700 GB VRAM16+ H100 80GBFull speed
FP8~860 GB8x H200 or 8x H100 80GBNear full speed
FP8 + 1M context~1,440 GB8x B200Full speed at max context
Q4_K_M GGUF~476 GBMulti-GPU clusterModerate
IQ2_XXS 2-bit GGUF~241 GBM4 Ultra Mac Studio 256GB3-9 tok/s
1-bit GGUF~176 GB system RAMLarge-RAM workstationVery slow

The practical floor for self-hosting is either 8x H200 GPUs (about $250,000 in hardware) or a maxed-out M4 Ultra Mac Studio ($10,000-$15,000) running 2-bit quants at 3-9 tokens per second. Neither is casual.

vLLM Setup (FP8, 8x H200)

This is the recommended production path. You’ll need vLLM 0.23.0+ and Transformers 5.4.0+.

uv pip install "vllm==0.23.0" --torch-backend=auto
uv pip install "transformers>=5.4.0"

vllm serve zai-org/GLM-5.2-FP8 \
  --kv-cache-dtype fp8_e4m3 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 5 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5.2-fp8

The --speculative-config.method mtp flag enables Multi-Token Prediction speculative decoding, which Zhipu designed specifically for this model. It predicts 5 tokens ahead and verifies in parallel, improving throughput without quality loss.

The tool-call and reasoning parsers (glm47, glm45) are model-specific. Don’t swap them for generic parsers or function calling will break.

llama.cpp Setup (2-bit GGUF, Consumer Hardware)

For the M4 Ultra crowd or multi-GPU hobbyist setups:

./llama.cpp/build/bin/llama-server \
  --model ./models/GLM-5-UD-IQ2_XXS.gguf \
  --ctx-size 16384 \
  --host 0.0.0.0 --port 8080 \
  --flash-attn auto

Note the context size: 16,384 tokens, not 1M. At 2-bit quantization with limited RAM, you won’t get anywhere near the full context window. The model still works well for shorter interactions, but the million-token headline feature is effectively unavailable at this quantization level.

GGUF quants are available from Unsloth at unsloth/GLM-5.2 on HuggingFace.

HuggingFace model card page for zai-org/glm-4-9b-chat showing model details and downloads on huggingface.co

GLM model card on HuggingFace (zai-org). The GLM-5.2 and GLM-5.2-FP8 weights are also hosted here. Screenshot from huggingface.co, captured 2026-06-17.

Smaller GLM Alternatives for Consumer Hardware

Not everyone has a rack of H200s. Zhipu maintains smaller models that run on hardware you can actually buy.

ModelParametersLicenseRuns OnComparable To
GLM-4-32B-041432BApache 2.0Consumer GPUs (24GB VRAM)GPT-4o class
GLM-Z1-32B-041432B (reasoning)MITConsumer GPUs (24GB VRAM)DeepSeek-R1 class
GLM-4-9B9BOpenEdge devices, laptopsLightweight tasks
GLM-Z1-9B9B (reasoning)OpenEdge devices, laptopsLightweight reasoning

The 32B models are genuinely useful. GLM-4-32B-0414 rivals GPT-4o on general tasks and fits on a single RTX 4090 at Q4 quantization. GLM-Z1-32B-0414 is a reasoning variant that competes with DeepSeek-R1 on math problems. Both run locally without cloud dependencies.

The 9B models are for edge deployment or laptop inference. They won’t match frontier quality, but they handle code completion, simple Q&A, and structured extraction at reasonable speeds on modest hardware.

Who Should Use GLM-5.2

Use it if: you want frontier-adjacent quality at open-source prices. The math benchmarks are best-in-class. The API pricing is 10-17x cheaper than Opus or GPT-5.5. And MIT license means no restrictions on commercial use, self-hosting, or fine-tuning. If you’re building production systems where cost per token matters and you can tolerate a small quality gap on coding tasks, GLM-5.2 is the strongest open option right now.

Watch it if: you care about coding benchmarks specifically. GLM-5.2 trails Opus 4.8 by 7 points on SWE-bench Pro and 21 points on NL2Repo. Those gaps are significant for agentic coding workflows. Zhipu may close them with fine-tuned coding variants, but that hasn’t happened yet. Also watch if you need multimodal: text-only is a real limitation for teams that pass screenshots or diagrams to their models.

Skip it if: you need the absolute best coding model and budget isn’t a constraint. Claude Opus 4.8 dominates on SWE-bench Pro, Terminal-Bench, NL2Repo, and HLE. GPT-5.5 leads on DeepSWE. If your workflow is shipping production code and you bill clients enough to cover Opus pricing, the quality difference justifies the cost.

Skip it if: you want to run it on consumer hardware at full quality. Even 2-bit GGUF quants need 241 GB of RAM and cap out at 3-9 tokens per second with severely reduced context. The smaller GLM-4-32B models are better fits for local development.

The Bigger Picture

GLM-5.2 lands at an interesting moment. Chinese open-weight models (DeepSeek, Qwen, Kimi, MiniMax, and now GLM) are converging on frontier-quality performance while undercutting US proprietary models on price by an order of magnitude. The MIT license, the 1M context, the $1.40 input pricing: these aren’t accidental. They’re a strategy to pull developer mindshare away from Opus and GPT-5.5.

The question isn’t whether GLM-5.2 is good. It is. The question is whether the coding gap matters for your specific workload. For math, science, and general reasoning, GLM-5.2 matches or beats everything except Opus on HLE. For multi-step software engineering, Opus still leads by a meaningful margin. And for simple tasks, the free Flash models or DeepSeek-V4-Pro offer similar value at even lower cost.

The three-day delay between release and benchmarks was a misstep. In a market where trust is currency, launching without numbers and then publishing them later feels like Zhipu was waiting to see how the model was received before committing to specific claims. The benchmarks are strong enough that the delay was unnecessary.

For the open-weights competitor that launched the same week, see our Kimi K2.7 Code breakdown. For a broader look at how these models compare on coding tasks, see DeepSeek V4 vs ChatGPT vs Claude.

Changelog

  • 2026-06-17: First publish. GLM-5.2 specs, benchmarks (published June 16), pricing, local setup, and community reaction as of four days post-launch.

Frequently asked

7 questions
Is GLM-5.2 really open source?

Yes. Zhipu released the weights under MIT license with no regional restrictions. You can download from HuggingFace (zai-org/GLM-5.2) and self-host. The training code is not included, only inference weights.

How much VRAM do I need to run GLM-5.2 locally?

At FP8 precision you need about 860 GB, which means 8x H200 GPUs. For consumer hardware, 2-bit GGUF quants need around 241 GB of system RAM (fits an M4 Ultra Mac Studio with 256 GB). Expect 3 to 9 tokens per second at that quantization.

How does GLM-5.2 compare to Claude Opus 4.8?

GLM-5.2 trails Opus 4.8 by 1 to 13 percent on coding benchmarks like SWE-bench Pro and Terminal-Bench. But it beats Opus 4.8 on math (99.2 vs 95.7 on AIME 2026) and costs about 10x less per token. It is the closest open-weight competitor to Opus 4.8.

What is the GLM-5.2 API pricing?

Input costs $1.40 per million tokens, cached input $0.26, and output $4.40. That is roughly 5 to 10x cheaper than Claude Opus 4.8 or GPT-5.5. Available through Z.ai directly, OpenRouter, GMI Cloud, and Cloudflare Workers AI.

Can I use GLM-5.2 with coding tools like Claude Code or Cursor?

GLM-5.2 works with Claude Code, Cline, OpenCode, Roo Code, Goose, and several other coding agents via OpenAI-compatible API endpoints. You point the tool at your vLLM server or an API provider like OpenRouter.

Does GLM-5.2 support images or multimodal input?

No. GLM-5.2 is text-only at launch. Earlier models like GLM-5V-Turbo support vision, but the 5.2 release focuses on text and code tasks only.

What are the smaller GLM models I can run on a regular GPU?

GLM-4-32B-0414 (Apache 2.0) and GLM-Z1-32B-0414 (MIT) both run on consumer GPUs with 24 GB VRAM. The 9B variants (GLM-4-9B, GLM-Z1-9B) work on even smaller hardware. These are solid mid-range options if the full 744B model is out of reach.

More in Models

View all