Models
Gemma 4 12B: Download, Ollama, GGUF & QAT Setup Guide
Download and run Gemma 4 12B locally with Ollama, GGUF (Q4_K_M), and QAT quantization. Includes VRAM requirements, hardware table, and setup guide.
Gemma 4 12B Unified landed on June 3, 2026. One weights file handles text, images, and audio without bolting on separate vision and audio encoders like the smaller E2B/E4B builds. 11.95B parameters, Apache 2.0, and a Q4 Ollama pull around 7.6 GB makes it the model people actually try on a 16 GB laptop. On June 5, Google shipped Quantization-Aware Training (QAT) checkpoints for the entire Gemma 4 family — the 12B Q4_0 GGUF now weighs about 6.7 GB with better quality than standard PTQ. Setup paths below include the new QAT repos.
This page is install notes: what changed from Gemma 3, how folks wire it into OpenCode, Codex CLI, and Crush, and where it still loses to cloud agents like Claude Code. For the wider 2026 model map, see Latest AI Models Compared (2026). Install paths start in Gemma 4 12B local setup.
Last updated: June 8, 2026. Live on aitoolsradar.org.
Quick specs
| Spec | Gemma 4 12B Unified (google/gemma-4-12B-it) |
|---|---|
| Parameters | 11.95B (dense, encoder-free) |
| Context | 256K tokens (model card); many runtimes cap lower |
| Modalities | Text, image, audio; video via frame sequences |
| License | Apache 2.0 (Gemma 4 license) |
| Release | 2026-06-03 (12B); Gemma 4 family from 2026-03-31 |
| Ollama disk (Q4_K_M) | ~7.6 GB (gemma4:12b) |
| QAT Q4_0 disk | ~6.7 GB weights (google/gemma-4-12B-it-qat-q4_0-gguf) |
| Best for | Local multimodal agents, OCR-style image Q&A, ASR, privacy-sensitive drafts |
| Watch out for | KV cache VRAM spikes; Ollama 12b without audio; 12B not on OpenRouter yet |

What makes Gemma 4 12B different
Google calls this variant Unified because it removed the standalone vision and audio encoders used on Gemma 4 E2B, E4B, and 31B. Instead:
- Vision: 48×48 pixel patches project through a small embedder (~35M params) into the decoder.
- Audio: 16 kHz waveforms become 40 ms frames, then a linear projection into the same embedding space.
- Text: Standard decoder-only transformer with hybrid attention (sliding window + global layers; final layer always global).
You skip the extra 150M to 550M vision tower and 300M audio encoder files that other Gemma 4 sizes carry. One download, one runtime.
Unified does not mean light on RAM. KV cache, projector weights, and image token budgets (70 to 1120 tokens per image on the card) still eat VRAM.
Benchmarks people cite (instruction-tuned, vendor table)
Google published these instruction-tuned numbers on the model card. We did not rerun the suites. Good for ballpark comparisons, not for picking a winner on your repo.
| Benchmark | Gemma 4 12B | Gemma 3 27B (no think) |
|---|---|---|
| MMLU Pro | 77.2% | 67.6% |
| AIME 2026 (no tools) | 77.5% | 20.8% |
| LiveCodeBench v6 | 72.0% | 29.1% |
| Codeforces ELO | 1659 | 110 |
| GPQA Diamond | 78.8% | 42.4% |
| Tau2 (agent avg) | 69.0% | 16.2% |
| MMMU Pro (vision) | 69.1% | 49.7% |
| MRCR v2 @ 128k | 43.4% | 13.5% |
For coding, stare at LiveCodeBench and Codeforces. For screenshots and PDFs rendered as images, add MMMU Pro. The table compares Gemma 4 12B to Gemma 3 27B, not Gemma 3 12B.
How people use it (launch week, June 2026)
From HN, Reddit, and a few local runs we trust:
1. Local coding and 2026 agent CLIs
- llama.cpp / Ollama + Q4: Developers pull
gemma4:12band run vibe-coding tests. Senko’s minesweeper bench (linked from HN) reported ~5 tok/s on a 12 GB GPU with fixable syntax mistakes, not frontier-cloud quality, but usable offline. - Terminal agents (June 2026): Builders wire the same local endpoint into OpenCode, Codex CLI (
ollama/lmstudioproviders), Crush, or Pi viabaseURLin config, not legacy IDE-only chat extensions. - Cloud agents (different lane): Claude Code, Cursor Agent, Google Antigravity, and GitHub Copilot cloud agent are common for paid daily coding in 2026. They do not natively target Ollama; you keep them on Anthropic/OpenAI/Google models and use Gemma 12B for private local passes.
- LiteRT-LM serve: Google documents port 9379 as an OpenAI-compatible shim for local Gemma (Edge blog). Point any agent that accepts a custom
baseURLat that server. - Hybrid stacks: HN commenters often pair Gemma 4 for multimodal with Qwen 3.5 9B for coding on the same 16 GB box, or run Codex/Claude Code in the cloud and OpenCode + Ollama offline on the same repo.
Worth the disk space if you want offline screenshots, audio clips, or drafts you will not send to a cloud API. It will not replace Claude Code or Codex on a nasty refactor. For terminal work, OpenCode or Crush pointed at Ollama is the usual pattern.
2. Multimodal workflows (OCR, PDF, screenshots, video, ASR)
| Workflow | How builders do it |
|---|---|
| Screenshot / UI Q&A | Feed PNGs in chat template; no separate OCR API |
| Render pages to images, or use Pi liteparse skills (Patrick Loeber) | |
| Video | Sample frames (e.g. 1 FPS); Google demoed keynote-length clips |
| ASR / translation | Native audio tokens on 12B; verify language support on your clip |
| Charts from CSV | Google AI Edge Gallery runs Python in a sandbox on Mac |
3. On-device (Mac and iPhone)
- Google AI Edge Gallery on iOS and macOS for on-device Gemma family models.
- Eloquent for voice edit with 12B on Mac (June 2026 launch week).
- LiteRT-LM for cross-platform CLI import and
litert-lm serve.
Hardware still matters: 12B on a phone is not guaranteed on every device.
4. Fine-tuning
- Unsloth Gemma 4 train guide: unified LoRA touches vision, audio, and text in one pass.
- Show HN: gemma-tuner-multimodal for Apple Silicon audio fine-tunes. Authors warn about OOM near ~2k tokens on 64 GB Macs.
Gemma 4 12B vs Gemma 3 12B (and vs E4B)
| Question | Answer |
|---|---|
gemma 4 12b vs gemma 3 12b | Gemma 4 adds audio at 12B, encoder-free fusion, 256K context, native system role, thinking + tool templates, Apache 2.0. Gemma 3 used separate encoders and older Gemma Terms. |
gemma 4 e4b vs gemma 3 12b | E4B is the phone-class ~4.5B effective model with encoders; 12B Unified targets laptops with stronger coding and agent scores. |
| vs Qwen 3.5 9B (local coding) | Community leans Qwen for Pi-style coding on 16 GB; Gemma wins multimodal + translation breadth in many HN threads. Test both. |
| vs Gemma 4 31B / 26B A4B | Use larger Gemma 4 if you have RAM; 12B is the 16 GB sweet spot. |
Gemma 4 12B local setup (June 2026)
Pick one path below. Default checkpoint: google/gemma-4-12B-it unless you host the base model yourself.
Path A: Ollama (fastest for most people)
Best when you want gemma 4 12b ollama in one command.
ollama pull gemma4:12b
ollama run gemma4:12b
| Tag | Disk | Context | Modalities (Ollama page) |
|---|---|---|---|
gemma4:12b | ~7.6 GB | 256K | Text, image |
gemma4:12b-it-q8_0 | ~13 GB | 256K | Text, image |
gemma4:12b-mlx | ~10 GB | 128K | Text only (Mac) |
Sampling (vendor readme): temperature 1.0, top_p 0.95, top_k 64.
Thinking in Ollama: put <|think|> at the start of the system prompt.
Vision prompt order: image before text.
Audio on 12B: not on Ollama’s 12b tag as of June 5. Use Path B or D for audio, or gemma4:e4b in Ollama for smaller audio-capable builds.
Path B: llama.cpp + Unsloth GGUF (control + vision)
Best when you care about gemma 4 12b unsloth quant quality. Unsloth now ships QAT GGUFs alongside the original quants.
- Download from unsloth/gemma-4-12b-it-GGUF or the QAT variants at unsloth/gemma-4-12B-it-qat-GGUF.
- Grab
mmproj-BF16.gguf(~175 MB) for vision. - Recommended quant:
UD-Q4_K_XL(~7.37 GB) per Unsloth KL benchmarks.
./llama-server \
--model gemma-4-12b-it-UD-Q4_K_XL.gguf \
--mmproj mmproj-BF16.gguf \
--temp 1.0 --top-p 0.95 --top-k 64 \
--port 8001 \
--chat-template-kwargs '{"enable_thinking":false}'GGUF size reference (12B IT, weights only):
| Quant | ~Size |
|---|---|
| UD-IQ3_XXS | 4.6 GB |
| Q4_K_M | 7.1 GB |
| UD-Q4_K_XL | 7.4 GB |
| Q5_K_M | 8.4 GB |
| Q8_0 | 12.7 GB |
| BF16 | 23.8 GB |
Path C: MLX on Mac (gemma 4 12b on mac)
| Asset | ~Size | Notes |
|---|---|---|
| mlx-community/gemma-4-12B-it-4bit | ~11 GB | Vision-capable MLX |
ollama run gemma4:12b-mlx | ~10 GB | Text-only, 128K |
Unsloth ships an install script for MLX chat:
curl -fsSL https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/scripts/install_gemma4_mlx.sh | sh
source ~/.unsloth/unsloth_gemma4_mlx/bin/activate
python -m mlx_vlm.chat --model mlx-community/gemma-4-12B-it-4bitM2/M3 16 GB: workable for Q4-class 12B with modest context. M2 8 GB: prefer gemma4:e4b or heavy quants.
Path D: Transformers (full multimodal + audio)
Best when you need the official template for audio URLs, video frames, and enable_thinking.
pip install -U "transformers>=5.10.1" torch accelerate torchvision librosafrom transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID, dtype="auto", device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize this repo layout in five bullets."},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)For images, put {"type": "image", "url": "..."} before text. For audio, put {"type": "audio", "audio": "..."} after instruction text (per model card).
Path E: LiteRT-LM (gemma 4 12b litertlm)
Google’s path for OpenAI-compatible local API and Mac agent workflows:
litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
gemma-4-12B-it.litertlm gemma4-12b
litert-lm serveWire 2026 agents to http://localhost:9379/v1 (verify port in your build):
| Agent | Config hook |
|---|---|
| OpenCode | baseURL → LiteRT-LM or http://localhost:11434/v1 for Ollama |
| Codex CLI | [model_providers.local] with ollama or custom base_url |
| Crush | OpenAI-compatible base_url in crush.json |
| Pi | ~/.pi/agent/models.json → http://localhost:11434/v1 |
Not a native local target: Claude Code, Cursor (BYOK cloud APIs only), Antigravity and Antigravity CLI (cloud models; replaces deprecated Gemini CLI).
Path F: vLLM (production GPU)
Use vLLM Gemma 4 12B recipe with nightly / gemma4-unified images. Plan 40 GB+ VRAM for comfortable BF16 serving; for lower VRAM, use the QAT compressed-tensors checkpoint at google/gemma-4-12B-it-qat-w4a16-ct (152k+ downloads). SGLang also supports the QAT checkpoints for efficient serving.
vllm serve google/gemma-4-12B-it \
--max-model-len 16384 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--limit-mm-per-prompt '{"image": 4, "audio": 1}'VRAM and RAM planning
Google’s weight-only table (verify-live on your quant):
| Precision | Weight memory |
|---|---|
| Q4_0 (QAT) | ~6.7 GB |
| 8-bit | ~13.4 GB |
| BF16 | ~26.7 GB |
Real-world rule: add 2–8+ GB for KV cache depending on context. Gemma models with large vocab can blow past weight size at 32K context (Ollama community reports).
| Hardware | Suggested quant | Realistic context |
|---|---|---|
| 8 GB GPU | Q3 / UD-IQ3_XXS | 4K–8K |
| 12 GB GPU | Q4_K_M | 8K–16K |
| 16 GB GPU / RAM | Ollama gemma4:12b | 16K–32K |
| 24 GB GPU | Q8_0 or Q5_K_M | 32K–64K |
| 40 GB+ | vLLM BF16 | 32K+ (raise carefully) |
KV cache tip (Ollama): try OLLAMA_KV_CACHE_TYPE=q8_0 if context spikes VRAM.
Gemma 4 QAT (Quantization-Aware Training) — June 5, 2026
Two days after the 12B launch, Google released QAT checkpoints for every Gemma 4 size. Instead of quantizing after training (PTQ), QAT simulates quantization during training itself. The result: compressed weights that hold more of the original model quality than standard PTQ quants.
For 12B, you now have three new Hugging Face repos:
| Repo | Format | ~Downloads (Jun 5) | Use for |
|---|---|---|---|
google/gemma-4-12B-it-qat-q4_0-gguf | GGUF Q4_0 | 52k+ | llama.cpp, Ollama import |
google/gemma-4-12B-it-qat-w4a16-ct | Compressed tensors | 152k+ | vLLM, SGLang |
google/gemma-4-12B-it-qat-q4_0-unquantized | Unquantized BF16 | 4.5k+ | Custom conversion to other formats |
Unsloth also ships QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF (121k+ downloads) with their UD quants. And Google published MTP QAT checkpoints so you keep the multi-token prediction speedup even with quantized weights.

Which quant to pick now: the QAT Q4_0 at ~6.7 GB is roughly 0.4 GB smaller than the standard Q4_K_M and Google’s benchmarks show it holds quality closer to the original BF16 model. If you already pulled gemma4:12b in Ollama and it works, no need to switch immediately. But for fresh installs or vLLM serving, start with the QAT checkpoints. You can also run QAT models directly in the browser via Transformers.js.
Thinking mode, tools, and system prompts
Gemma 4 adds a native system role and structured function calling tokens. For agents:
- Declare tools in
apply_chat_template(..., tools=[...]). - Enable thinking when plans are hard:
enable_thinking=True. - On the next user turn, drop thought channels from history, except between tool calls in one agent turn (thinking docs).
Multi-token prediction (MTP): optional drafter checkpoints for faster inference (MTP overview). Supported stacks include Ollama, MLX, vLLM, and LiteRT-LM. Google also released MTP QAT checkpoints, so you can combine faster decoding with the quality-preserving QAT quantization.
Coding agents in 2026 (what to pair with Gemma 4 12B)
If you are picking agents in mid-2026, split the problem in two: terminal CLIs that can hit a local OpenAI URL, and cloud IDE agents that want Anthropic, OpenAI, or Google keys. Gemma 12B only fits the first bucket.
What developers actually run (June 2026)

| Agent | Type | Local Gemma 4 12B? | Typical role |
|---|---|---|---|
| Claude Code | Terminal + IDE + desktop | No (cloud Anthropic; gateway only) | Daily agent work, MCP, subagents |
| Codex CLI | Terminal (OpenAI) | Yes. native ollama / lmstudio providers | codex exec, worktrees, automation |
| OpenCode | Open terminal + desktop | Yes. Ollama + any OpenAI-compatible URL | Free/open multi-provider agent |
| Crush | Terminal (Charm) | Yes. base_url in config | TUI coding agent, MCP, LSP-aware edits |
| Cursor | AI IDE + CLI | No (BYOK cloud keys only) | IDE Agent, Cloud Agent handoff |
| Antigravity + CLI | Google IDE + terminal | No (cloud models) | Consumer Gemini CLI moves to Antigravity CLI (Google blog); enterprise Gemini CLI continues |
| GitHub Copilot cloud agent | Cloud PR agent | No | Repo-wide tasks on GitHub infra |
| Pi | Minimal harness | Yes. models.json | Power users, extensions, local control |
| OpenClaw | Messaging orchestrator | Via wired backend | Delegates to Codex/Cursor/Claude from chat |
Google’s launch blog still name-drops older OpenAI-shim tools. In June 2026 threads the local wiring is usually OpenCode, Codex CLI, or Crush.
Recommended stack: local Gemma + cloud frontier
- Run weights:
ollama pull gemma4:12borlitert-lm serve. - Local agent:
ollama launch opencodeor Pi/Codex pointed athttp://localhost:11434/v1. - Hard tasks: same repo in Claude Code (Opus) or Codex (GPT-5.5 class) on cloud.
- Google UI lane: Antigravity or Antigravity CLI. Google is retiring the consumer Gemini CLI in favor of Antigravity CLI; enterprise Gemini CLI continues. Local Gemma stays on Ollama/OpenCode.
OpenCode + Ollama (copy-paste pattern)
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": { "baseURL": "http://localhost:11434/v1" },
"models": {
"gemma4:12b": { "name": "Gemma 4 12B (local)" }
}
}
}
}Docs: Ollama + OpenCode. Start around 8k to 16k context on a 16 GB box. Raise num_ctx until VRAM complains; 64k is wishful on most laptops.
Codex CLI + Ollama (sketch)
In ~/.codex/config.toml (verify-live against advanced config):
[model_providers.local_ollama]
base_url = "http://localhost:11434/v1"Then select gemma4:12b as the model for a sandboxed task. Codex is the OpenAI agent CLI. Useful when you already live in codex exec but want offline weights.
Crush + local OpenAI-compatible API
Point an OpenAI-compatible provider at Ollama or LiteRT-LM in crush.json (see Crush README configuration section). Same pattern as Open WebUI + local server setups.
Coding fit table (Gemma 4 12B itself)
| Use case | Fit |
|---|---|
| Offline refactors via OpenCode/Crush | Good with Q4 + 8K–16K context |
| Same repo as Claude Code/Codex cloud | Hybrid: local for private files, cloud for ship |
| Tool-calling agents | Strong vendor Tau2 scores; test JSON schema in OpenCode |
| Repo-wide 128K reasoning | Possible in theory; watch VRAM on 16 GB |
| Antigravity / Cursor default | Use cloud models; Gemma is a parallel local lane |
API and cloud routes (gemma 4 12b api)
| Route | 12B status (June 5, 2026) |
|---|---|
| Hugging Face Inference / Endpoints | google/gemma-4-12B-it |
OpenRouter google/gemma-4-12b-it | Not listed; use google/gemma-4-31b-it if you need hosted Gemma 4 today |
| Google AI Studio / Gemini API | Family docs emphasize 26B A4B and 31B. Check Studio for 12B availability |
| Vertex Model Garden | Production Gemma 4 family |
| Local OpenAI shim | litert-lm serve, LM Studio server, llama-server |
See our OpenRouter Free Models (2026) guide for router patterns; swap model IDs when 12B appears.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| CUDA OOM at modest context | KV + vision tokens | Lower num_ctx; Q4 quant; fewer image tokens |
| VRAM much larger than 7.6 GB | Embedding tables + f16 KV | Cap context; OLLAMA_KV_CACHE_TYPE=q8_0 |
| Vision fails in llama.cpp | Missing mmproj | Add mmproj-BF16.gguf |
AutoModel class errors | Old Transformers | pip install -U transformers>=5.10.1 |
| No audio in Ollama 12b | Tag limitation | Transformers, LiteRT-LM, or vLLM |
| Empty thought tags with thinking off | 12B template quirk | enable_thinking=False or strip in post-process |
| vLLM model not found | Needs unified nightly | Use gemma4-unified container per recipe |
Who should use, watch, or skip
| Audience | Verdict |
|---|---|
| Privacy-first dev with 16 GB GPU | Use: Ollama or UD-Q4_K_XL + mmproj |
| Multimodal agent on Mac | Use: LiteRT-LM or MLX 4-bit |
| Production API at scale | Watch: prefer 31B hosted or vLLM on 40 GB+ until 12B API listings stabilize |
| Best local coder only | Watch: benchmark against Qwen 3.5 9B on your prompts |
| Need guaranteed commercial audio rights for ads | Skip for client work until legal reviews Terms |
Related reading on AI Tools Radar
- Latest AI Models Compared (2026). Where Gemma 4 sits next to GPT-5.5 and DeepSeek V4
- DeepSeek V4 vs ChatGPT vs Claude for Coding (2026). If coding is your main stack
- Devin Desktop vs Cursor (2026). Picking an IDE agent lane beside local Gemma
- OpenRouter Free Models (2026). Hosted Gemma 4 31B patterns until 12B routes ship
Changelog
- 2026-06-05: First publish. Local setup for Ollama, MLX, LiteRT-LM, and 2026 coding agents (OpenCode, Codex CLI, Crush). Consumer Gemini CLI moves to Antigravity CLI.
- 2026-06-08 (update): Added Gemma 4 QAT (Quantization-Aware Training) checkpoints — Q4_0 GGUF, vLLM compressed tensors, and Unsloth QAT GGUFs for 12B. MTP QAT checkpoints also available.
Frequently asked
10 questionsWhat is Gemma 4 12B?
Gemma 4 12B Unified is Google DeepMind's encoder-free open model with about 11.95 billion parameters. It handles text, images, and audio in one decoder-only transformer, supports up to 256K context on the model card, and ships under Apache 2.0. Use the instruction-tuned checkpoint google/gemma-4-12B-it for chat and agents.
How much VRAM do I need for Gemma 4 12B locally?
A Q4_K_M GGUF or Ollama gemma4:12b build needs roughly 7.6 GB for weights plus a vision projector and extra memory for KV cache. Plan 16 GB unified memory or VRAM for comfortable 8K–16K context. Eight-gigabyte GPUs can run aggressive Q3 quants with short context only.
Does Ollama support Gemma 4 12B?
Yes. Run ollama pull gemma4:12b for a 7.6 GB Q4_K_M build with text and image input and 256K advertised context. Ollama's 12b tag does not list audio today; use Transformers, LiteRT-LM, or vLLM nightly for native 12B audio on your machine.
Is Gemma 4 12B good for coding?
It scores well on vendor coding benchmarks such as LiveCodeBench v6 and Codeforces ELO versus Gemma 3. Early community tests show workable local coding at about 5 tokens per second on a 12 GB GPU, but many devs still prefer Qwen 3.5 9B or larger Gemma 4 sizes for agent loops. Run your own repo prompts before you switch stacks.
How do I enable thinking mode on Gemma 4 12B?
Add the think token at the start of the system prompt or pass enable_thinking=True in Transformers apply_chat_template. The model emits a thought channel before the final answer. Strip thought blocks from chat history on the next turn except during multi-step tool calls in one agent turn.
Gemma 4 12B vs Gemma 3 12B: which should I download?
Pick Gemma 4 12B if you want native audio at this size, encoder-free multimodal fusion, Apache 2.0 licensing, 256K context, and stronger math and coding scores on Google's published tables. Stay on Gemma 3 12B only if you already tuned a pipeline and do not need audio or the new chat template yet.
Can I use Gemma 4 12B on a Mac?
Yes. Ollama offers gemma4:12b-mlx for text at about 10 GB on Apple Silicon, or use mlx-community/gemma-4-12B-it-4bit for vision-capable MLX at about 11 GB. Google's AI Edge Gallery and LiteRT-LM also target Mac laptops with OpenAI-compatible local serve.
Is Gemma 4 12B on OpenRouter?
As of June 5, 2026, OpenRouter lists google/gemma-4-31b-it and free variants, not the 12B slug. For hosted API access to the 12B class, check Google AI Studio or Vertex Model Garden. For local agents, run Ollama or litert-lm serve and point OpenCode, Codex CLI, Crush, or Pi at the OpenAI-compatible base URL.
Which AI coding agents work with local Gemma 4 12B in 2026?
OpenCode, Pi, Codex CLI, and Crush support custom base URLs or native Ollama providers. Claude Code, Cursor, Google Antigravity, and GitHub Copilot cloud agent expect cloud models for full agent features. Consumer Gemini CLI is moving to Antigravity CLI; enterprise Gemini CLI continues. Pair Gemma locally with OpenCode or Crush for terminal agents. Keep Claude Code on Anthropic and Codex on OpenAI for hard production repos.
What is Gemma 4 QAT and where do I get the 12B checkpoints?
QAT (Quantization-Aware Training) checkpoints released June 5, 2026 simulate quantization during training, so compressed weights hold more quality than standard post-training quants. For 12B, grab the Q4_0 GGUF at google/gemma-4-12B-it-qat-q4_0-gguf (~6.7 GB), compressed tensors for vLLM at google/gemma-4-12B-it-qat-w4a16-ct, or Unsloth QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF. MTP QAT checkpoints are also available.
More in Models
View all
GLM-5.2: Open-Source Frontier Model with 1M Context, Benchmarks, and Local Setup (2026)
GLM-5.2 from Zhipu AI is a 744B open-weight model under MIT license. Benchmarks, pricing, local setup with vLLM and llama.cpp, and how it compares to Claude Opus 4.8 and GPT-5.5.
Models

Kimi K2.7 Code (2026): 1T MoE Coding Model, Benchmarks & Pricing
Kimi K2.7 Code: 1T open-source coding model from Moonshot AI, 32B active MoE, preserve_thinking mode, benchmarks vs GPT-5.5 and Claude Opus.
Models

MiniMax M3 Open Source (2026): 428B Model, 1M Context & Benchmarks
MiniMax M3: 428B open-weights model, 1M context via sparse attention, native multimodal input, competitive coding benchmarks, and 10x cheaper than GPT-5.5.
Models
More stories
View all
US Government Blocks Anthropic Fable 5 & Mythos 5 (2026)
US government ban on Anthropic: Commerce Dept ordered suspension of Fable 5 & Mythos 5 on June 12, 2026. Full timeline of the 4-month feud.
Models

Siri AI Review (2026): Apple's Rebuilt Assistant vs ChatGPT & Gemini [Tested]
Siri AI is Apple's rebuilt assistant for 2026. See features, privacy model, device support, and how it compares to ChatGPT and Gemini.
Review

Claude Fable 5 Release (2026): Anthropic's Most Powerful AI Model Explained
Claude Fable 5 is the first Mythos-class model available to the public. State-of-the-art coding, vision, and knowledge work with new safeguards. Pricing, benchmarks, and what it means.
Models

Ideogram AI Review (2026): Free Tier Tested, vs Midjourney & Recraft
Ideogram AI review (2026): we tested free tier, pricing, text rendering, and Ideogram 4.0 vs Midjourney and Recraft. Who should use it?
Review