Gemma 4 12B local setup guide featured image with connected AI model nodes on AI Tools Radar

Models

Gemma 4 12B: Download, Ollama, GGUF & QAT Setup Guide

Download and run Gemma 4 12B locally with Ollama, GGUF (Q4_K_M), and QAT quantization. Includes VRAM requirements, hardware table, and setup guide.

AI Tools Radar Editorial June 5, 2026 Updated June 8, 2026 16 min read

Gemma 4 12B Unified landed on June 3, 2026. One weights file handles text, images, and audio without bolting on separate vision and audio encoders like the smaller E2B/E4B builds. 11.95B parameters, Apache 2.0, and a Q4 Ollama pull around 7.6 GB makes it the model people actually try on a 16 GB laptop. On June 5, Google shipped Quantization-Aware Training (QAT) checkpoints for the entire Gemma 4 family — the 12B Q4_0 GGUF now weighs about 6.7 GB with better quality than standard PTQ. Setup paths below include the new QAT repos.

This page is install notes: what changed from Gemma 3, how folks wire it into OpenCode, Codex CLI, and Crush, and where it still loses to cloud agents like Claude Code. For the wider 2026 model map, see Latest AI Models Compared (2026). Install paths start in Gemma 4 12B local setup.

Last updated: June 8, 2026. Live on aitoolsradar.org.

Quick specs

Spec	Gemma 4 12B Unified (`google/gemma-4-12B-it`)
Parameters	11.95B (dense, encoder-free)
Context	256K tokens (model card); many runtimes cap lower
Modalities	Text, image, audio; video via frame sequences
License	Apache 2.0 (Gemma 4 license)
Release	2026-06-03 (12B); Gemma 4 family from 2026-03-31
Ollama disk (Q4_K_M)	~7.6 GB (`gemma4:12b`)
QAT Q4_0 disk	~6.7 GB weights (`google/gemma-4-12B-it-qat-q4_0-gguf`)
Best for	Local multimodal agents, OCR-style image Q&A, ASR, privacy-sensitive drafts
Watch out for	KV cache VRAM spikes; Ollama 12b without audio; 12B not on OpenRouter yet

Hugging Face model page for google/gemma-4-12B-it with downloads and multimodal tags — Hugging Face card for `google/gemma-4-12B-it`. Screenshot from huggingface.co, captured June 5, 2026. Tags and download counts change daily.

What makes Gemma 4 12B different

Google calls this variant Unified because it removed the standalone vision and audio encoders used on Gemma 4 E2B, E4B, and 31B. Instead:

Vision: 48×48 pixel patches project through a small embedder (~35M params) into the decoder.
Audio: 16 kHz waveforms become 40 ms frames, then a linear projection into the same embedding space.
Text: Standard decoder-only transformer with hybrid attention (sliding window + global layers; final layer always global).

You skip the extra 150M to 550M vision tower and 300M audio encoder files that other Gemma 4 sizes carry. One download, one runtime.

Unified does not mean light on RAM. KV cache, projector weights, and image token budgets (70 to 1120 tokens per image on the card) still eat VRAM.

Benchmarks people cite (instruction-tuned, vendor table)

Google published these instruction-tuned numbers on the model card. We did not rerun the suites. Good for ballpark comparisons, not for picking a winner on your repo.

Benchmark	Gemma 4 12B	Gemma 3 27B (no think)
MMLU Pro	77.2%	67.6%
AIME 2026 (no tools)	77.5%	20.8%
LiveCodeBench v6	72.0%	29.1%
Codeforces ELO	1659	110
GPQA Diamond	78.8%	42.4%
Tau2 (agent avg)	69.0%	16.2%
MMMU Pro (vision)	69.1%	49.7%
MRCR v2 @ 128k	43.4%	13.5%

For coding, stare at LiveCodeBench and Codeforces. For screenshots and PDFs rendered as images, add MMMU Pro. The table compares Gemma 4 12B to Gemma 3 27B, not Gemma 3 12B.

How people use it (launch week, June 2026)

From HN, Reddit, and a few local runs we trust:

1. Local coding and 2026 agent CLIs

llama.cpp / Ollama + Q4: Developers pull gemma4:12b and run vibe-coding tests. Senko’s minesweeper bench (linked from HN) reported ~5 tok/s on a 12 GB GPU with fixable syntax mistakes, not frontier-cloud quality, but usable offline.
Terminal agents (June 2026): Builders wire the same local endpoint into OpenCode, Codex CLI (ollama / lmstudio providers), Crush, or Pi via baseURL in config, not legacy IDE-only chat extensions.
Cloud agents (different lane): Claude Code, Cursor Agent, Google Antigravity, and GitHub Copilot cloud agent are common for paid daily coding in 2026. They do not natively target Ollama; you keep them on Anthropic/OpenAI/Google models and use Gemma 12B for private local passes.
LiteRT-LM serve: Google documents port 9379 as an OpenAI-compatible shim for local Gemma (Edge blog). Point any agent that accepts a custom baseURL at that server.
Hybrid stacks: HN commenters often pair Gemma 4 for multimodal with Qwen 3.5 9B for coding on the same 16 GB box, or run Codex/Claude Code in the cloud and OpenCode + Ollama offline on the same repo.

Worth the disk space if you want offline screenshots, audio clips, or drafts you will not send to a cloud API. It will not replace Claude Code or Codex on a nasty refactor. For terminal work, OpenCode or Crush pointed at Ollama is the usual pattern.

2. Multimodal workflows (OCR, PDF, screenshots, video, ASR)

Workflow	How builders do it
Screenshot / UI Q&A	Feed PNGs in chat template; no separate OCR API
PDF	Render pages to images, or use Pi liteparse skills (Patrick Loeber)
Video	Sample frames (e.g. 1 FPS); Google demoed keynote-length clips
ASR / translation	Native audio tokens on 12B; verify language support on your clip
Charts from CSV	Google AI Edge Gallery runs Python in a sandbox on Mac

3. On-device (Mac and iPhone)

Google AI Edge Gallery on iOS and macOS for on-device Gemma family models.
Eloquent for voice edit with 12B on Mac (June 2026 launch week).
LiteRT-LM for cross-platform CLI import and litert-lm serve.

Hardware still matters: 12B on a phone is not guaranteed on every device.

4. Fine-tuning

Unsloth Gemma 4 train guide: unified LoRA touches vision, audio, and text in one pass.
Show HN: gemma-tuner-multimodal for Apple Silicon audio fine-tunes. Authors warn about OOM near ~2k tokens on 64 GB Macs.

Gemma 4 12B vs Gemma 3 12B (and vs E4B)

Question	Answer
`gemma 4 12b vs gemma 3 12b`	Gemma 4 adds audio at 12B, encoder-free fusion, 256K context, native `system` role, thinking + tool templates, Apache 2.0. Gemma 3 used separate encoders and older Gemma Terms.
`gemma 4 e4b vs gemma 3 12b`	E4B is the phone-class ~4.5B effective model with encoders; 12B Unified targets laptops with stronger coding and agent scores.
vs Qwen 3.5 9B (local coding)	Community leans Qwen for Pi-style coding on 16 GB; Gemma wins multimodal + translation breadth in many HN threads. Test both.
vs Gemma 4 31B / 26B A4B	Use larger Gemma 4 if you have RAM; 12B is the 16 GB sweet spot.

Gemma 4 12B local setup (June 2026)

Pick one path below. Default checkpoint: google/gemma-4-12B-it unless you host the base model yourself.

Path A: Ollama (fastest for most people)

Best when you want gemma 4 12b ollama in one command.

ollama pull gemma4:12b
ollama run gemma4:12b

Ollama library page for gemma4:12b showing Q4_K_M size and vision tag — Ollama `gemma4:12b` tag (~7.6 GB Q4_K_M). Screenshot from ollama.com, captured June 5, 2026. Disk size and tags may change.

Tag	Disk	Context	Modalities (Ollama page)
`gemma4:12b`	~7.6 GB	256K	Text, image
`gemma4:12b-it-q8_0`	~13 GB	256K	Text, image
`gemma4:12b-mlx`	~10 GB	128K	Text only (Mac)

Sampling (vendor readme): temperature 1.0, top_p 0.95, top_k 64.

Thinking in Ollama: put <|think|> at the start of the system prompt.

Vision prompt order: image before text.

Audio on 12B: not on Ollama’s 12b tag as of June 5. Use Path B or D for audio, or gemma4:e4b in Ollama for smaller audio-capable builds.

Path B: llama.cpp + Unsloth GGUF (control + vision)

Best when you care about gemma 4 12b unsloth quant quality. Unsloth now ships QAT GGUFs alongside the original quants.

Download from unsloth/gemma-4-12b-it-GGUF or the QAT variants at unsloth/gemma-4-12B-it-qat-GGUF.
Grab mmproj-BF16.gguf (~175 MB) for vision.
Recommended quant: UD-Q4_K_XL (~7.37 GB) per Unsloth KL benchmarks.

./llama-server \
  --model gemma-4-12b-it-UD-Q4_K_XL.gguf \
  --mmproj mmproj-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8001 \
  --chat-template-kwargs '{"enable_thinking":false}'

GGUF size reference (12B IT, weights only):

Quant	~Size
UD-IQ3_XXS	4.6 GB
Q4_K_M	7.1 GB
UD-Q4_K_XL	7.4 GB
Q5_K_M	8.4 GB
Q8_0	12.7 GB
BF16	23.8 GB

Path C: MLX on Mac (`gemma 4 12b on mac`)

Asset	~Size	Notes
mlx-community/gemma-4-12B-it-4bit	~11 GB	Vision-capable MLX
`ollama run gemma4:12b-mlx`	~10 GB	Text-only, 128K

Unsloth ships an install script for MLX chat:

curl -fsSL https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/scripts/install_gemma4_mlx.sh | sh
source ~/.unsloth/unsloth_gemma4_mlx/bin/activate
python -m mlx_vlm.chat --model mlx-community/gemma-4-12B-it-4bit

M2/M3 16 GB: workable for Q4-class 12B with modest context. M2 8 GB: prefer gemma4:e4b or heavy quants.

Path D: Transformers (full multimodal + audio)

Best when you need the official template for audio URLs, video frames, and enable_thinking.

pip install -U "transformers>=5.10.1" torch accelerate torchvision librosa

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-12B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, dtype="auto", device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize this repo layout in five bullets."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024)

For images, put {"type": "image", "url": "..."} before text. For audio, put {"type": "audio", "audio": "..."} after instruction text (per model card).

Path E: LiteRT-LM (`gemma 4 12b litertlm`)

Google’s path for OpenAI-compatible local API and Mac agent workflows:

litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve

Wire 2026 agents to http://localhost:9379/v1 (verify port in your build):

Agent	Config hook
OpenCode	`baseURL` → LiteRT-LM or `http://localhost:11434/v1` for Ollama
Codex CLI	`[model_providers.local]` with `ollama` or custom `base_url`
Crush	OpenAI-compatible `base_url` in `crush.json`
Pi	`~/.pi/agent/models.json` → `http://localhost:11434/v1`

Not a native local target: Claude Code, Cursor (BYOK cloud APIs only), Antigravity and Antigravity CLI (cloud models; replaces deprecated Gemini CLI).

Path F: vLLM (production GPU)

Use vLLM Gemma 4 12B recipe with nightly / gemma4-unified images. Plan 40 GB+ VRAM for comfortable BF16 serving; for lower VRAM, use the QAT compressed-tensors checkpoint at google/gemma-4-12B-it-qat-w4a16-ct (152k+ downloads). SGLang also supports the QAT checkpoints for efficient serving.

vllm serve google/gemma-4-12B-it \
  --max-model-len 16384 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}'

VRAM and RAM planning

Google’s weight-only table (verify-live on your quant):

Precision	Weight memory
Q4_0 (QAT)	~6.7 GB
8-bit	~13.4 GB
BF16	~26.7 GB

Real-world rule: add 2–8+ GB for KV cache depending on context. Gemma models with large vocab can blow past weight size at 32K context (Ollama community reports).

Hardware	Suggested quant	Realistic context
8 GB GPU	Q3 / UD-IQ3_XXS	4K–8K
12 GB GPU	Q4_K_M	8K–16K
16 GB GPU / RAM	Ollama `gemma4:12b`	16K–32K
24 GB GPU	Q8_0 or Q5_K_M	32K–64K
40 GB+	vLLM BF16	32K+ (raise carefully)

KV cache tip (Ollama): try OLLAMA_KV_CACHE_TYPE=q8_0 if context spikes VRAM.

Gemma 4 QAT (Quantization-Aware Training) — June 5, 2026

Two days after the 12B launch, Google released QAT checkpoints for every Gemma 4 size. Instead of quantizing after training (PTQ), QAT simulates quantization during training itself. The result: compressed weights that hold more of the original model quality than standard PTQ quants.

For 12B, you now have three new Hugging Face repos:

Repo	Format	~Downloads (Jun 5)	Use for
`google/gemma-4-12B-it-qat-q4_0-gguf`	GGUF Q4_0	52k+	llama.cpp, Ollama import
`google/gemma-4-12B-it-qat-w4a16-ct`	Compressed tensors	152k+	vLLM, SGLang
`google/gemma-4-12B-it-qat-q4_0-unquantized`	Unquantized BF16	4.5k+	Custom conversion to other formats

Unsloth also ships QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF (121k+ downloads) with their UD quants. And Google published MTP QAT checkpoints so you keep the multi-token prediction speedup even with quantized weights.

Hugging Face repo page for google/gemma-4-12B-it-qat-q4_0-gguf showing QAT model card — Google's official QAT Q4_0 GGUF for 12B on Hugging Face. Screenshot from huggingface.co, captured 2026-06-05. Download counts change daily.

Which quant to pick now: the QAT Q4_0 at ~6.7 GB is roughly 0.4 GB smaller than the standard Q4_K_M and Google’s benchmarks show it holds quality closer to the original BF16 model. If you already pulled gemma4:12b in Ollama and it works, no need to switch immediately. But for fresh installs or vLLM serving, start with the QAT checkpoints. You can also run QAT models directly in the browser via Transformers.js.

Thinking mode, tools, and system prompts

Gemma 4 adds a native system role and structured function calling tokens. For agents:

Declare tools in apply_chat_template(..., tools=[...]).
Enable thinking when plans are hard: enable_thinking=True.
On the next user turn, drop thought channels from history, except between tool calls in one agent turn (thinking docs).

Multi-token prediction (MTP): optional drafter checkpoints for faster inference (MTP overview). Supported stacks include Ollama, MLX, vLLM, and LiteRT-LM. Google also released MTP QAT checkpoints, so you can combine faster decoding with the quality-preserving QAT quantization.

Coding agents in 2026 (what to pair with Gemma 4 12B)

If you are picking agents in mid-2026, split the problem in two: terminal CLIs that can hit a local OpenAI URL, and cloud IDE agents that want Anthropic, OpenAI, or Google keys. Gemma 12B only fits the first bucket.

What developers actually run (June 2026)

OpenCode AI coding agent homepage on opencode.ai — OpenCode terminal agent UI. Screenshot from opencode.ai, captured June 5, 2026. Use with Ollama `baseURL` for local Gemma.

Agent	Type	Local Gemma 4 12B?	Typical role
Claude Code	Terminal + IDE + desktop	No (cloud Anthropic; gateway only)	Daily agent work, MCP, subagents
Codex CLI	Terminal (OpenAI)	Yes. native `ollama` / `lmstudio` providers	`codex exec`, worktrees, automation
OpenCode	Open terminal + desktop	Yes. Ollama + any OpenAI-compatible URL	Free/open multi-provider agent
Crush	Terminal (Charm)	Yes. `base_url` in config	TUI coding agent, MCP, LSP-aware edits
Cursor	AI IDE + CLI	No (BYOK cloud keys only)	IDE Agent, Cloud Agent handoff
Antigravity + CLI	Google IDE + terminal	No (cloud models)	Consumer Gemini CLI moves to Antigravity CLI (Google blog); enterprise Gemini CLI continues
GitHub Copilot cloud agent	Cloud PR agent	No	Repo-wide tasks on GitHub infra
Pi	Minimal harness	Yes. `models.json`	Power users, extensions, local control
OpenClaw	Messaging orchestrator	Via wired backend	Delegates to Codex/Cursor/Claude from chat

Google’s launch blog still name-drops older OpenAI-shim tools. In June 2026 threads the local wiring is usually OpenCode, Codex CLI, or Crush.

Recommended stack: local Gemma + cloud frontier

Run weights: ollama pull gemma4:12b or litert-lm serve.
Local agent: ollama launch opencode or Pi/Codex pointed at http://localhost:11434/v1.
Hard tasks: same repo in Claude Code (Opus) or Codex (GPT-5.5 class) on cloud.
Google UI lane: Antigravity or Antigravity CLI. Google is retiring the consumer Gemini CLI in favor of Antigravity CLI; enterprise Gemini CLI continues. Local Gemma stays on Ollama/OpenCode.

OpenCode + Ollama (copy-paste pattern)

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": { "baseURL": "http://localhost:11434/v1" },
      "models": {
        "gemma4:12b": { "name": "Gemma 4 12B (local)" }
      }
    }
  }
}

Docs: Ollama + OpenCode. Start around 8k to 16k context on a 16 GB box. Raise num_ctx until VRAM complains; 64k is wishful on most laptops.

Codex CLI + Ollama (sketch)

In ~/.codex/config.toml (verify-live against advanced config):

[model_providers.local_ollama]
base_url = "http://localhost:11434/v1"

Then select gemma4:12b as the model for a sandboxed task. Codex is the OpenAI agent CLI. Useful when you already live in codex exec but want offline weights.

Crush + local OpenAI-compatible API

Point an OpenAI-compatible provider at Ollama or LiteRT-LM in crush.json (see Crush README configuration section). Same pattern as Open WebUI + local server setups.

Coding fit table (Gemma 4 12B itself)

Use case	Fit
Offline refactors via OpenCode/Crush	Good with Q4 + 8K–16K context
Same repo as Claude Code/Codex cloud	Hybrid: local for private files, cloud for ship
Tool-calling agents	Strong vendor Tau2 scores; test JSON schema in OpenCode
Repo-wide 128K reasoning	Possible in theory; watch VRAM on 16 GB
Antigravity / Cursor default	Use cloud models; Gemma is a parallel local lane

API and cloud routes (`gemma 4 12b api`)

Route	12B status (June 5, 2026)
Hugging Face Inference / Endpoints	`google/gemma-4-12B-it`
OpenRouter `google/gemma-4-12b-it`	Not listed; use `google/gemma-4-31b-it` if you need hosted Gemma 4 today
Google AI Studio / Gemini API	Family docs emphasize 26B A4B and 31B. Check Studio for 12B availability
Vertex Model Garden	Production Gemma 4 family
Local OpenAI shim	`litert-lm serve`, LM Studio server, `llama-server`

See our OpenRouter Free Models (2026) guide for router patterns; swap model IDs when 12B appears.

Troubleshooting

Symptom	Likely cause	Fix
CUDA OOM at modest context	KV + vision tokens	Lower `num_ctx`; Q4 quant; fewer image tokens
VRAM much larger than 7.6 GB	Embedding tables + f16 KV	Cap context; `OLLAMA_KV_CACHE_TYPE=q8_0`
Vision fails in llama.cpp	Missing mmproj	Add `mmproj-BF16.gguf`
`AutoModel` class errors	Old Transformers	`pip install -U transformers>=5.10.1`
No audio in Ollama 12b	Tag limitation	Transformers, LiteRT-LM, or vLLM
Empty thought tags with thinking off	12B template quirk	`enable_thinking=False` or strip in post-process
vLLM model not found	Needs unified nightly	Use gemma4-unified container per recipe

Who should use, watch, or skip

Audience	Verdict
Privacy-first dev with 16 GB GPU	Use: Ollama or UD-Q4_K_XL + mmproj
Multimodal agent on Mac	Use: LiteRT-LM or MLX 4-bit
Production API at scale	Watch: prefer 31B hosted or vLLM on 40 GB+ until 12B API listings stabilize
Best local coder only	Watch: benchmark against Qwen 3.5 9B on your prompts
Need guaranteed commercial audio rights for ads	Skip for client work until legal reviews Terms

Latest AI Models Compared (2026). Where Gemma 4 sits next to GPT-5.5 and DeepSeek V4
DeepSeek V4 vs ChatGPT vs Claude for Coding (2026). If coding is your main stack
Devin Desktop vs Cursor (2026). Picking an IDE agent lane beside local Gemma
OpenRouter Free Models (2026). Hosted Gemma 4 31B patterns until 12B routes ship

Changelog

2026-06-05: First publish. Local setup for Ollama, MLX, LiteRT-LM, and 2026 coding agents (OpenCode, Codex CLI, Crush). Consumer Gemini CLI moves to Antigravity CLI.
2026-06-08 (update): Added Gemma 4 QAT (Quantization-Aware Training) checkpoints — Q4_0 GGUF, vLLM compressed tensors, and Unsloth QAT GGUFs for 12B. MTP QAT checkpoints also available.

Frequently asked

10 questions

What is Gemma 4 12B?

Gemma 4 12B Unified is Google DeepMind's encoder-free open model with about 11.95 billion parameters. It handles text, images, and audio in one decoder-only transformer, supports up to 256K context on the model card, and ships under Apache 2.0. Use the instruction-tuned checkpoint google/gemma-4-12B-it for chat and agents.

How much VRAM do I need for Gemma 4 12B locally?

A Q4_K_M GGUF or Ollama gemma4:12b build needs roughly 7.6 GB for weights plus a vision projector and extra memory for KV cache. Plan 16 GB unified memory or VRAM for comfortable 8K–16K context. Eight-gigabyte GPUs can run aggressive Q3 quants with short context only.

Does Ollama support Gemma 4 12B?

Yes. Run ollama pull gemma4:12b for a 7.6 GB Q4_K_M build with text and image input and 256K advertised context. Ollama's 12b tag does not list audio today; use Transformers, LiteRT-LM, or vLLM nightly for native 12B audio on your machine.

Is Gemma 4 12B good for coding?

It scores well on vendor coding benchmarks such as LiveCodeBench v6 and Codeforces ELO versus Gemma 3. Early community tests show workable local coding at about 5 tokens per second on a 12 GB GPU, but many devs still prefer Qwen 3.5 9B or larger Gemma 4 sizes for agent loops. Run your own repo prompts before you switch stacks.

How do I enable thinking mode on Gemma 4 12B?

Add the think token at the start of the system prompt or pass enable_thinking=True in Transformers apply_chat_template. The model emits a thought channel before the final answer. Strip thought blocks from chat history on the next turn except during multi-step tool calls in one agent turn.

Gemma 4 12B vs Gemma 3 12B: which should I download?

Pick Gemma 4 12B if you want native audio at this size, encoder-free multimodal fusion, Apache 2.0 licensing, 256K context, and stronger math and coding scores on Google's published tables. Stay on Gemma 3 12B only if you already tuned a pipeline and do not need audio or the new chat template yet.

Can I use Gemma 4 12B on a Mac?

Yes. Ollama offers gemma4:12b-mlx for text at about 10 GB on Apple Silicon, or use mlx-community/gemma-4-12B-it-4bit for vision-capable MLX at about 11 GB. Google's AI Edge Gallery and LiteRT-LM also target Mac laptops with OpenAI-compatible local serve.

Is Gemma 4 12B on OpenRouter?

As of June 5, 2026, OpenRouter lists google/gemma-4-31b-it and free variants, not the 12B slug. For hosted API access to the 12B class, check Google AI Studio or Vertex Model Garden. For local agents, run Ollama or litert-lm serve and point OpenCode, Codex CLI, Crush, or Pi at the OpenAI-compatible base URL.

Which AI coding agents work with local Gemma 4 12B in 2026?

OpenCode, Pi, Codex CLI, and Crush support custom base URLs or native Ollama providers. Claude Code, Cursor, Google Antigravity, and GitHub Copilot cloud agent expect cloud models for full agent features. Consumer Gemini CLI is moving to Antigravity CLI; enterprise Gemini CLI continues. Pair Gemma locally with OpenCode or Crush for terminal agents. Keep Claude Code on Anthropic and Codex on OpenAI for hard production repos.

What is Gemma 4 QAT and where do I get the 12B checkpoints?

QAT (Quantization-Aware Training) checkpoints released June 5, 2026 simulate quantization during training, so compressed weights hold more quality than standard post-training quants. For 12B, grab the Q4_0 GGUF at google/gemma-4-12B-it-qat-q4_0-gguf (~6.7 GB), compressed tensors for vLLM at google/gemma-4-12B-it-qat-w4a16-ct, or Unsloth QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF. MTP QAT checkpoints are also available.