Models
GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)
GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.
Short answer (June 2026): GLM-5.2 is Zhipu AI’s 744B open-weight model under MIT license with 1M-token context and 384 Mixture of Experts routing ~40B parameters per token. It tops AIME 2026 at 99.2, trails Claude Opus 4.8 on coding, and costs roughly 10x less per token than either Opus or GPT-5.5. Text-only at launch. Weights on HuggingFace. The closest an open model has come to frontier proprietary performance.
For the full June 2026 model landscape, see Latest AI Models Compared (2026). For the coding model comparison it slots into, see DeepSeek V4 vs ChatGPT vs Claude. For all models in this category, see our Models hub.
Last updated: June 17, 2026. Live on aitoolsradar.org.
Quick Specs
| Spec | GLM-5.2 |
|---|---|
| Release date | June 13, 2026 |
| Developer | Zhipu AI (Z.ai), formerly THUDM / Tsinghua University |
| Total parameters | 744 billion |
| Active per token | ~40 billion (MoE, 384 experts) |
| Context window | 1,000,000 tokens |
| Max output | 131,072 tokens |
| Modality | Text-only (no vision at launch) |
| License | MIT (fully open weights) |
| Attention | IndexShare sparse attention (2.9x FLOP reduction at 1M context) |
| Speculative decoding | Improved Multi-Token Prediction (MTP) |
| API pricing | $1.40/M in, $0.26/M cached, $4.40/M out |
| HuggingFace | zai-org/GLM-5.2, zai-org/GLM-5.2-FP8 |
| GGUF quants | unsloth/GLM-5.2 |
What Is GLM-5.2
Zhipu AI shipped GLM-5.2 on June 13, 2026, exactly 24 hours after the US government ordered Anthropic’s Fable 5 offline. The timing drew attention. So did the specs: 744B parameters total, ~40B active per token, MIT license, 1M context, no usage restrictions.
The model family started at Tsinghua University and grew through Zhipu AI, the commercial spinoff. The lineage runs GLM-5 (February 11), GLM-5-Turbo (March 15), GLM-5.1 (April 7), and now GLM-5.2. Each version widened the gap with the previous one, but 5.2 is the first to land in genuine frontier territory on math and science benchmarks.
GLM-5.2 uses a Mixture of Experts architecture with 384 experts. On each token, it activates roughly 40 billion parameters. That’s a higher activation ratio than Kimi K2.7 (32B active out of 1T) and lower total parameters, which changes the compute profile: fewer experts to store, more parameters per expert, potentially better per-expert specialization.
The 1M context window relies on IndexShare sparse attention, which Zhipu claims reduces per-token FLOPs by 2.9x at full context length. That’s a big claim. Whether it holds on real-world retrieval tasks (not synthetic needle-in-haystack) remains to be seen. The 131,072 max output is also large enough for complete file generation in a single pass.
One notable absence: multimodal input. GLM-5.2 is text-only. Earlier models in the family (GLM-5V-Turbo) handled images. Zhipu stripped that out to focus the compute budget on text and code quality. If you need vision, you’ll have to wait for a future variant or use a different model.

Benchmarks
Zhipu didn’t publish benchmarks at launch. They appeared three days later, on June 16. That delay drew criticism. The numbers themselves, once published, told an interesting story.
| Benchmark | GLM-5.2 | Claude Opus 4.8 | GPT-5.5 | DeepSeek-V4-Pro | Qwen3.7-Max |
|---|---|---|---|---|---|
| AIME 2026 | 99.2 | 95.7 | 98.3 | 94.6 | 97.0 |
| IMOAnswerBench | 91.0 | 83.5 | n/a | 89.8 | 90.0 |
| GPQA-Diamond | 91.2 | 93.6 | 93.6 | 90.1 | 90.0 |
| HLE | 40.5 | 49.8 | 41.4 | 37.7 | 41.4 |
| SWE-bench Pro | 62.1 | 69.2 | 58.6 | 55.4 | 60.6 |
| Terminal-Bench 2.1 | 81.0 | 85.0 | 84.0 | 64.0 | 75.0 |
| NL2Repo | 48.9 | 69.7 | 50.7 | 35.5 | 47.2 |
| DeepSWE | 46.2 | 58.0 | 70.0 | 8.0 | 18.0 |
What the Numbers Mean
Math and science are GLM-5.2’s strongest suit. The 99.2 on AIME 2026 is the highest published score from any model, open or closed. IMOAnswerBench at 91.0 also leads. GPQA-Diamond (graduate-level science) lands at 91.2, just behind Opus 4.8 and GPT-5.5 at 93.6.
Coding is competitive but not leading. SWE-bench Pro at 62.1 trails Opus 4.8 by 7.1 points. Terminal-Bench 2.1 at 81.0 is 4 points behind Opus and 3 behind GPT-5.5. NL2Repo at 48.9 is 20.8 points behind Opus. DeepSWE at 46.2 trails GPT-5.5 by nearly 24 points.
The honest read: GLM-5.2 is the best open-weight model available right now on math. On coding, it’s good but clearly behind Claude Opus 4.8 and (on some benchmarks) GPT-5.5. The gap narrows on shorter, more structured tasks and widens on open-ended software engineering.
One caveat: these are vendor-reported scores. Independent verification takes weeks. Treat them as directional, not final. GLM-5.2 beat DeepSeek-V4-Pro and Qwen3.7-Max on every benchmark listed here, which makes it the open-weight leader. Whether it can close the gap with Opus 4.8 on real coding workloads is a separate question.

How People Are Using It
GLM-5.2 dropped into a crowded field but found real traction quickly. The Hacker News thread hit 616 upvotes with 340+ comments. Zhipu’s stock surged 48% intraday on the Hong Kong Stock Exchange before closing at +32.8%.
Long-horizon agentic coding. The 1M context window plus 131K max output makes GLM-5.2 attractive for multi-hour software engineering sessions. Developers report loading entire repositories into context and running multi-step refactors without losing coherence.
Codebase analysis. At 1M tokens, you can feed a mid-size repo into GLM-5.2 and ask structural questions. The IndexShare sparse attention is supposed to keep retrieval quality high even at extreme context lengths.
Tool-use workflows. GLM-5.2 supports function calling and MCP. It works with Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, and Kilo Code through OpenAI-compatible API endpoints. You point your coding tool at a vLLM server or an API provider, and it works.
Complex debugging. The sustained coherence over long contexts helps for debugging sessions where the model needs to track multiple files, error traces, and test outputs simultaneously.
Community criticism focused on two things. First, the three-day delay between launch and benchmarks felt like Zhipu was gauging reception before publishing numbers. Second, text-only at launch disappointed teams that needed vision input. The weights themselves were also delayed slightly after the announcement, though they’re now available on HuggingFace.
Pricing
GLM-5.2’s pricing is its clearest advantage over closed models.
| Model | Input (per 1M tokens) | Cached Input (per 1M) | Output (per 1M tokens) |
|---|---|---|---|
| GLM-5.2 | $1.40 | $0.26 | $4.40 |
| Claude Opus 4.8 | ~$15.00 | varies | ~$75.00 |
| GPT-5.5 | ~$10.00 | varies | ~$30.00 |
| DeepSeek-V4-Pro | cheaper | cheaper | cheaper |
| GLM-4.5-Flash | FREE | FREE | FREE |
| GLM-4.7-Flash | FREE | FREE | FREE |
The gap is enormous. GLM-5.2’s output costs $4.40/M versus Opus 4.8 at roughly $75/M. That’s about 17x cheaper. Input is roughly 10x cheaper than Opus. Even against GPT-5.5, it’s 7x cheaper on output.
For teams running continuous agent loops or batch processing, this changes the math entirely. A task that costs $75 on Opus costs $4.40 on GLM-5.2. Even if you need 2-3x more iterations to match Opus quality, you’re still spending far less.
GLM-5.2 is available on OpenRouter, GMI Cloud, Novita, and Cloudflare Workers AI. Zhipu also offers free access to the older Flash models (GLM-4.5-Flash, GLM-4.7-Flash), which are decent for lighter tasks. See our OpenRouter Free Models guide for more on free-tier options.
How to Run GLM-5.2 Locally
This is a 744B model. “Running locally” means different things depending on your budget.
Hardware Requirements
| Configuration | VRAM / RAM Needed | Hardware Example | Expected Speed |
|---|---|---|---|
| BF16 (full precision) | ~1,500-1,700 GB VRAM | 16+ H100 80GB | Full speed |
| FP8 | ~860 GB | 8x H200 or 8x H100 80GB | Near full speed |
| FP8 + 1M context | ~1,440 GB | 8x B200 | Full speed at max context |
| Q4_K_M GGUF | ~476 GB | Multi-GPU cluster | Moderate |
| IQ2_XXS 2-bit GGUF | ~241 GB | M4 Ultra Mac Studio 256GB | 3-9 tok/s |
| 1-bit GGUF | ~176 GB system RAM | Large-RAM workstation | Very slow |
The practical floor for self-hosting is either 8x H200 GPUs (about $250,000 in hardware) or a maxed-out M4 Ultra Mac Studio ($10,000-$15,000) running 2-bit quants at 3-9 tokens per second. Neither is casual.
vLLM Setup (FP8, 8x H200)
This is the recommended production path. You’ll need vLLM 0.23.0+ and Transformers 5.4.0+.
uv pip install "vllm==0.23.0" --torch-backend=auto
uv pip install "transformers>=5.4.0"
vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8The --speculative-config.method mtp flag enables Multi-Token Prediction speculative decoding, which Zhipu designed specifically for this model. It predicts 5 tokens ahead and verifies in parallel, improving throughput without quality loss.
The tool-call and reasoning parsers (glm47, glm45) are model-specific. Don’t swap them for generic parsers or function calling will break.
llama.cpp Setup (2-bit GGUF, Consumer Hardware)
For the M4 Ultra crowd or multi-GPU hobbyist setups:
./llama.cpp/build/bin/llama-server \
--model ./models/GLM-5-UD-IQ2_XXS.gguf \
--ctx-size 16384 \
--host 0.0.0.0 --port 8080 \
--flash-attn autoNote the context size: 16,384 tokens, not 1M. At 2-bit quantization with limited RAM, you won’t get anywhere near the full context window. The model still works well for shorter interactions, but the million-token headline feature is effectively unavailable at this quantization level.
GGUF quants are available from Unsloth at unsloth/GLM-5.2 on HuggingFace.

Smaller GLM Alternatives for Consumer Hardware
Not everyone has a rack of H200s. Zhipu maintains smaller models that run on hardware you can actually buy.
| Model | Parameters | License | Runs On | Comparable To |
|---|---|---|---|---|
| GLM-4-32B-0414 | 32B | Apache 2.0 | Consumer GPUs (24GB VRAM) | GPT-4o class |
| GLM-Z1-32B-0414 | 32B (reasoning) | MIT | Consumer GPUs (24GB VRAM) | DeepSeek-R1 class |
| GLM-4-9B | 9B | Open | Edge devices, laptops | Lightweight tasks |
| GLM-Z1-9B | 9B (reasoning) | Open | Edge devices, laptops | Lightweight reasoning |
The 32B models are genuinely useful. GLM-4-32B-0414 rivals GPT-4o on general tasks and fits on a single RTX 4090 at Q4 quantization. GLM-Z1-32B-0414 is a reasoning variant that competes with DeepSeek-R1 on math problems. Both run locally without cloud dependencies.
The 9B models are for edge deployment or laptop inference. They won’t match frontier quality, but they handle code completion, simple Q&A, and structured extraction at reasonable speeds on modest hardware.
Who Should Use GLM-5.2
Use it if: you want frontier-adjacent quality at open-source prices. The math benchmarks are best-in-class. The API pricing is 10-17x cheaper than Opus or GPT-5.5. And MIT license means no restrictions on commercial use, self-hosting, or fine-tuning. If you’re building production systems where cost per token matters and you can tolerate a small quality gap on coding tasks, GLM-5.2 is the strongest open option right now.
Watch it if: you care about coding benchmarks specifically. GLM-5.2 trails Opus 4.8 by 7 points on SWE-bench Pro and 21 points on NL2Repo. Those gaps are significant for agentic coding workflows. Zhipu may close them with fine-tuned coding variants, but that hasn’t happened yet. Also watch if you need multimodal: text-only is a real limitation for teams that pass screenshots or diagrams to their models.
Skip it if: you need the absolute best coding model and budget isn’t a constraint. Claude Opus 4.8 dominates on SWE-bench Pro, Terminal-Bench, NL2Repo, and HLE. GPT-5.5 leads on DeepSWE. If your workflow is shipping production code and you bill clients enough to cover Opus pricing, the quality difference justifies the cost.
Skip it if: you want to run it on consumer hardware at full quality. Even 2-bit GGUF quants need 241 GB of RAM and cap out at 3-9 tokens per second with severely reduced context. The smaller GLM-4-32B models are better fits for local development.
The Bigger Picture
GLM-5.2 lands at an interesting moment. Chinese open-weight models (DeepSeek, Qwen, Kimi, MiniMax, and now GLM) are converging on frontier-quality performance while undercutting US proprietary models on price by an order of magnitude. The MIT license, the 1M context, the $1.40 input pricing: these aren’t accidental. They’re a strategy to pull developer mindshare away from Opus and GPT-5.5.
The question isn’t whether GLM-5.2 is good. It is. The question is whether the coding gap matters for your specific workload. For math, science, and general reasoning, GLM-5.2 matches or beats everything except Opus on HLE. For multi-step software engineering, Opus still leads by a meaningful margin. And for simple tasks, the free Flash models or DeepSeek-V4-Pro offer similar value at even lower cost.
The three-day delay between release and benchmarks was a misstep. In a market where trust is currency, launching without numbers and then publishing them later feels like Zhipu was waiting to see how the model was received before committing to specific claims. The benchmarks are strong enough that the delay was unnecessary.
For the open-weights competitor that launched the same week, see our Kimi K2.7 Code breakdown. For a broader look at how these models compare on coding tasks, see DeepSeek V4 vs ChatGPT vs Claude.
Changelog
- 2026-06-17: First publish. GLM-5.2 specs, benchmarks (published June 16), pricing, local setup, and community reaction as of four days post-launch.
Frequently asked
7 questionsIs GLM-5.2 really open source?
Yes. Zhipu released the weights under MIT license with no regional restrictions. You can download from HuggingFace (zai-org/GLM-5.2) and self-host. The training code is not included, only inference weights.
How much VRAM do I need to run GLM-5.2 locally?
At FP8 precision you need about 860 GB, which means 8x H200 GPUs. For consumer hardware, 2-bit GGUF quants need around 241 GB of system RAM (fits an M4 Ultra Mac Studio with 256 GB). Expect 3 to 9 tokens per second at that quantization.
How does GLM-5.2 compare to Claude Opus 4.8?
GLM-5.2 trails Opus 4.8 by 1 to 13 percent on coding benchmarks like SWE-bench Pro and Terminal-Bench. But it beats Opus 4.8 on math (99.2 vs 95.7 on AIME 2026) and costs about 10x less per token. It is the closest open-weight competitor to Opus 4.8.
What is the GLM-5.2 API pricing?
Input costs $1.40 per million tokens, cached input $0.26, and output $4.40. That is roughly 5 to 10x cheaper than Claude Opus 4.8 or GPT-5.5. Available through Z.ai directly, OpenRouter, GMI Cloud, and Cloudflare Workers AI.
Can I use GLM-5.2 with coding tools like Claude Code or Cursor?
GLM-5.2 works with Claude Code, Cline, OpenCode, Roo Code, Goose, and several other coding agents via OpenAI-compatible API endpoints. You point the tool at your vLLM server or an API provider like OpenRouter.
Does GLM-5.2 support images or multimodal input?
No. GLM-5.2 is text-only at launch. Earlier models like GLM-5V-Turbo support vision, but the 5.2 release focuses on text and code tasks only.
What are the smaller GLM models I can run on a regular GPU?
GLM-4-32B-0414 (Apache 2.0) and GLM-Z1-32B-0414 (MIT) both run on consumer GPUs with 24 GB VRAM. The 9B variants (GLM-4-9B, GLM-Z1-9B) work on even smaller hardware. These are solid mid-range options if the full 744B model is out of reach.
More in Models
View all
Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing
Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.
Models

MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks
MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.
Models

US Government Blocks Anthropic Fable 5 & Mythos 5 (2026)
US government ban on Anthropic: Commerce Dept ordered suspension of Fable 5 & Mythos 5 on June 12, 2026. Full timeline of the 4-month feud.
Models
More stories
View all
Siri AI Review (2026): Apple's Rebuilt Assistant vs ChatGPT & Gemini [Tested]
Siri AI is Apple's rebuilt assistant for 2026. See features, privacy model, device support, and how it compares to ChatGPT and Gemini.
Review

Claude Fable 5 Release (2026): Anthropic's Most Powerful AI Model Explained
Claude Fable 5 is the first Mythos-class model available to the public. State-of-the-art coding, vision, and knowledge work with new safeguards. Pricing, benchmarks, and what it means.
Models

Ideogram AI Review (2026): Free Tier Tested, vs Midjourney & Recraft
Ideogram AI review (2026): we tested free tier, pricing, text rendering, and Ideogram 4.0 vs Midjourney and Recraft. Who should use it?
Review

Genspark Speakly Review (2026): Pricing, Accuracy & Is It Worth It?
Honest genspark speakly review after hands-on testing. See Speakly pricing, accuracy, free tier limits, and how it compares to Otter and Whisper.
Review