AI Tools Radar
中文
Gemma 4 12B local setup guide featured image with connected AI model nodes on AI Tools Radar

Models

Gemma 4 12B: Download, Ollama, GGUF & QAT Setup Guide

Download and run Gemma 4 12B locally with Ollama, GGUF (Q4_K_M), and QAT quantization. Includes VRAM requirements, hardware table, and setup guide.

AI Tools Radar Editorial 16 min read

Gemma 4 12B Unified landed on June 3, 2026. One weights file handles text, images, and audio without bolting on separate vision and audio encoders like the smaller E2B/E4B builds. 11.95B parameters, Apache 2.0, and a Q4 Ollama pull around 7.6 GB makes it the model people actually try on a 16 GB laptop. On June 5, Google shipped Quantization-Aware Training (QAT) checkpoints for the entire Gemma 4 family — the 12B Q4_0 GGUF now weighs about 6.7 GB with better quality than standard PTQ. Setup paths below include the new QAT repos.

This page is install notes: what changed from Gemma 3, how folks wire it into OpenCode, Codex CLI, and Crush, and where it still loses to cloud agents like Claude Code. For the wider 2026 model map, see Latest AI Models Compared (2026). Install paths start in Gemma 4 12B local setup.

Last updated: June 8, 2026. Live on aitoolsradar.org.

Quick specs

SpecGemma 4 12B Unified (google/gemma-4-12B-it)
Parameters11.95B (dense, encoder-free)
Context256K tokens (model card); many runtimes cap lower
ModalitiesText, image, audio; video via frame sequences
LicenseApache 2.0 (Gemma 4 license)
Release2026-06-03 (12B); Gemma 4 family from 2026-03-31
Ollama disk (Q4_K_M)~7.6 GB (gemma4:12b)
QAT Q4_0 disk~6.7 GB weights (google/gemma-4-12B-it-qat-q4_0-gguf)
Best forLocal multimodal agents, OCR-style image Q&A, ASR, privacy-sensitive drafts
Watch out forKV cache VRAM spikes; Ollama 12b without audio; 12B not on OpenRouter yet

Hugging Face model page for google/gemma-4-12B-it with downloads and multimodal tags

Hugging Face card for `google/gemma-4-12B-it`. Screenshot from huggingface.co, captured June 5, 2026. Tags and download counts change daily.

What makes Gemma 4 12B different

Google calls this variant Unified because it removed the standalone vision and audio encoders used on Gemma 4 E2B, E4B, and 31B. Instead:

  • Vision: 48×48 pixel patches project through a small embedder (~35M params) into the decoder.
  • Audio: 16 kHz waveforms become 40 ms frames, then a linear projection into the same embedding space.
  • Text: Standard decoder-only transformer with hybrid attention (sliding window + global layers; final layer always global).

You skip the extra 150M to 550M vision tower and 300M audio encoder files that other Gemma 4 sizes carry. One download, one runtime.

Unified does not mean light on RAM. KV cache, projector weights, and image token budgets (70 to 1120 tokens per image on the card) still eat VRAM.

Benchmarks people cite (instruction-tuned, vendor table)

Google published these instruction-tuned numbers on the model card. We did not rerun the suites. Good for ballpark comparisons, not for picking a winner on your repo.

BenchmarkGemma 4 12BGemma 3 27B (no think)
MMLU Pro77.2%67.6%
AIME 2026 (no tools)77.5%20.8%
LiveCodeBench v672.0%29.1%
Codeforces ELO1659110
GPQA Diamond78.8%42.4%
Tau2 (agent avg)69.0%16.2%
MMMU Pro (vision)69.1%49.7%
MRCR v2 @ 128k43.4%13.5%

For coding, stare at LiveCodeBench and Codeforces. For screenshots and PDFs rendered as images, add MMMU Pro. The table compares Gemma 4 12B to Gemma 3 27B, not Gemma 3 12B.

How people use it (launch week, June 2026)

From HN, Reddit, and a few local runs we trust:

1. Local coding and 2026 agent CLIs

  • llama.cpp / Ollama + Q4: Developers pull gemma4:12b and run vibe-coding tests. Senko’s minesweeper bench (linked from HN) reported ~5 tok/s on a 12 GB GPU with fixable syntax mistakes, not frontier-cloud quality, but usable offline.
  • Terminal agents (June 2026): Builders wire the same local endpoint into OpenCode, Codex CLI (ollama / lmstudio providers), Crush, or Pi via baseURL in config, not legacy IDE-only chat extensions.
  • Cloud agents (different lane): Claude Code, Cursor Agent, Google Antigravity, and GitHub Copilot cloud agent are common for paid daily coding in 2026. They do not natively target Ollama; you keep them on Anthropic/OpenAI/Google models and use Gemma 12B for private local passes.
  • LiteRT-LM serve: Google documents port 9379 as an OpenAI-compatible shim for local Gemma (Edge blog). Point any agent that accepts a custom baseURL at that server.
  • Hybrid stacks: HN commenters often pair Gemma 4 for multimodal with Qwen 3.5 9B for coding on the same 16 GB box, or run Codex/Claude Code in the cloud and OpenCode + Ollama offline on the same repo.

Worth the disk space if you want offline screenshots, audio clips, or drafts you will not send to a cloud API. It will not replace Claude Code or Codex on a nasty refactor. For terminal work, OpenCode or Crush pointed at Ollama is the usual pattern.

2. Multimodal workflows (OCR, PDF, screenshots, video, ASR)

WorkflowHow builders do it
Screenshot / UI Q&AFeed PNGs in chat template; no separate OCR API
PDFRender pages to images, or use Pi liteparse skills (Patrick Loeber)
VideoSample frames (e.g. 1 FPS); Google demoed keynote-length clips
ASR / translationNative audio tokens on 12B; verify language support on your clip
Charts from CSVGoogle AI Edge Gallery runs Python in a sandbox on Mac

3. On-device (Mac and iPhone)

  • Google AI Edge Gallery on iOS and macOS for on-device Gemma family models.
  • Eloquent for voice edit with 12B on Mac (June 2026 launch week).
  • LiteRT-LM for cross-platform CLI import and litert-lm serve.

Hardware still matters: 12B on a phone is not guaranteed on every device.

4. Fine-tuning

Gemma 4 12B vs Gemma 3 12B (and vs E4B)

QuestionAnswer
gemma 4 12b vs gemma 3 12bGemma 4 adds audio at 12B, encoder-free fusion, 256K context, native system role, thinking + tool templates, Apache 2.0. Gemma 3 used separate encoders and older Gemma Terms.
gemma 4 e4b vs gemma 3 12bE4B is the phone-class ~4.5B effective model with encoders; 12B Unified targets laptops with stronger coding and agent scores.
vs Qwen 3.5 9B (local coding)Community leans Qwen for Pi-style coding on 16 GB; Gemma wins multimodal + translation breadth in many HN threads. Test both.
vs Gemma 4 31B / 26B A4BUse larger Gemma 4 if you have RAM; 12B is the 16 GB sweet spot.

Gemma 4 12B local setup (June 2026)

Pick one path below. Default checkpoint: google/gemma-4-12B-it unless you host the base model yourself.

Path A: Ollama (fastest for most people)

Best when you want gemma 4 12b ollama in one command.

ollama pull gemma4:12b
ollama run gemma4:12b

Ollama library page for gemma4:12b showing Q4_K_M size and vision tag

Ollama `gemma4:12b` tag (~7.6 GB Q4_K_M). Screenshot from ollama.com, captured June 5, 2026. Disk size and tags may change.
TagDiskContextModalities (Ollama page)
gemma4:12b~7.6 GB256KText, image
gemma4:12b-it-q8_0~13 GB256KText, image
gemma4:12b-mlx~10 GB128KText only (Mac)

Sampling (vendor readme): temperature 1.0, top_p 0.95, top_k 64.

Thinking in Ollama: put <|think|> at the start of the system prompt.

Vision prompt order: image before text.

Audio on 12B: not on Ollama’s 12b tag as of June 5. Use Path B or D for audio, or gemma4:e4b in Ollama for smaller audio-capable builds.

Path B: llama.cpp + Unsloth GGUF (control + vision)

Best when you care about gemma 4 12b unsloth quant quality. Unsloth now ships QAT GGUFs alongside the original quants.

  1. Download from unsloth/gemma-4-12b-it-GGUF or the QAT variants at unsloth/gemma-4-12B-it-qat-GGUF.
  2. Grab mmproj-BF16.gguf (~175 MB) for vision.
  3. Recommended quant: UD-Q4_K_XL (~7.37 GB) per Unsloth KL benchmarks.
./llama-server \
  --model gemma-4-12b-it-UD-Q4_K_XL.gguf \
  --mmproj mmproj-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8001 \
  --chat-template-kwargs '{"enable_thinking":false}'

GGUF size reference (12B IT, weights only):

Quant~Size
UD-IQ3_XXS4.6 GB
Q4_K_M7.1 GB
UD-Q4_K_XL7.4 GB
Q5_K_M8.4 GB
Q8_012.7 GB
BF1623.8 GB

Path C: MLX on Mac (gemma 4 12b on mac)

Asset~SizeNotes
mlx-community/gemma-4-12B-it-4bit~11 GBVision-capable MLX
ollama run gemma4:12b-mlx~10 GBText-only, 128K

Unsloth ships an install script for MLX chat:

curl -fsSL https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/scripts/install_gemma4_mlx.sh | sh
source ~/.unsloth/unsloth_gemma4_mlx/bin/activate
python -m mlx_vlm.chat --model mlx-community/gemma-4-12B-it-4bit

M2/M3 16 GB: workable for Q4-class 12B with modest context. M2 8 GB: prefer gemma4:e4b or heavy quants.

Path D: Transformers (full multimodal + audio)

Best when you need the official template for audio URLs, video frames, and enable_thinking.

pip install -U "transformers>=5.10.1" torch accelerate torchvision librosa
from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-12B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, dtype="auto", device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize this repo layout in five bullets."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024)

For images, put {"type": "image", "url": "..."} before text. For audio, put {"type": "audio", "audio": "..."} after instruction text (per model card).

Path E: LiteRT-LM (gemma 4 12b litertlm)

Google’s path for OpenAI-compatible local API and Mac agent workflows:

litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve

Wire 2026 agents to http://localhost:9379/v1 (verify port in your build):

AgentConfig hook
OpenCodebaseURL → LiteRT-LM or http://localhost:11434/v1 for Ollama
Codex CLI[model_providers.local] with ollama or custom base_url
CrushOpenAI-compatible base_url in crush.json
Pi~/.pi/agent/models.jsonhttp://localhost:11434/v1

Not a native local target: Claude Code, Cursor (BYOK cloud APIs only), Antigravity and Antigravity CLI (cloud models; replaces deprecated Gemini CLI).

Path F: vLLM (production GPU)

Use vLLM Gemma 4 12B recipe with nightly / gemma4-unified images. Plan 40 GB+ VRAM for comfortable BF16 serving; for lower VRAM, use the QAT compressed-tensors checkpoint at google/gemma-4-12B-it-qat-w4a16-ct (152k+ downloads). SGLang also supports the QAT checkpoints for efficient serving.

vllm serve google/gemma-4-12B-it \
  --max-model-len 16384 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}'

VRAM and RAM planning

Google’s weight-only table (verify-live on your quant):

PrecisionWeight memory
Q4_0 (QAT)~6.7 GB
8-bit~13.4 GB
BF16~26.7 GB

Real-world rule: add 2–8+ GB for KV cache depending on context. Gemma models with large vocab can blow past weight size at 32K context (Ollama community reports).

HardwareSuggested quantRealistic context
8 GB GPUQ3 / UD-IQ3_XXS4K–8K
12 GB GPUQ4_K_M8K–16K
16 GB GPU / RAMOllama gemma4:12b16K–32K
24 GB GPUQ8_0 or Q5_K_M32K–64K
40 GB+vLLM BF1632K+ (raise carefully)

KV cache tip (Ollama): try OLLAMA_KV_CACHE_TYPE=q8_0 if context spikes VRAM.

Gemma 4 QAT (Quantization-Aware Training) — June 5, 2026

Two days after the 12B launch, Google released QAT checkpoints for every Gemma 4 size. Instead of quantizing after training (PTQ), QAT simulates quantization during training itself. The result: compressed weights that hold more of the original model quality than standard PTQ quants.

For 12B, you now have three new Hugging Face repos:

RepoFormat~Downloads (Jun 5)Use for
google/gemma-4-12B-it-qat-q4_0-ggufGGUF Q4_052k+llama.cpp, Ollama import
google/gemma-4-12B-it-qat-w4a16-ctCompressed tensors152k+vLLM, SGLang
google/gemma-4-12B-it-qat-q4_0-unquantizedUnquantized BF164.5k+Custom conversion to other formats

Unsloth also ships QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF (121k+ downloads) with their UD quants. And Google published MTP QAT checkpoints so you keep the multi-token prediction speedup even with quantized weights.

Hugging Face repo page for google/gemma-4-12B-it-qat-q4_0-gguf showing QAT model card

Google's official QAT Q4_0 GGUF for 12B on Hugging Face. Screenshot from huggingface.co, captured 2026-06-05. Download counts change daily.

Which quant to pick now: the QAT Q4_0 at ~6.7 GB is roughly 0.4 GB smaller than the standard Q4_K_M and Google’s benchmarks show it holds quality closer to the original BF16 model. If you already pulled gemma4:12b in Ollama and it works, no need to switch immediately. But for fresh installs or vLLM serving, start with the QAT checkpoints. You can also run QAT models directly in the browser via Transformers.js.

Thinking mode, tools, and system prompts

Gemma 4 adds a native system role and structured function calling tokens. For agents:

  1. Declare tools in apply_chat_template(..., tools=[...]).
  2. Enable thinking when plans are hard: enable_thinking=True.
  3. On the next user turn, drop thought channels from history, except between tool calls in one agent turn (thinking docs).

Multi-token prediction (MTP): optional drafter checkpoints for faster inference (MTP overview). Supported stacks include Ollama, MLX, vLLM, and LiteRT-LM. Google also released MTP QAT checkpoints, so you can combine faster decoding with the quality-preserving QAT quantization.

Coding agents in 2026 (what to pair with Gemma 4 12B)

If you are picking agents in mid-2026, split the problem in two: terminal CLIs that can hit a local OpenAI URL, and cloud IDE agents that want Anthropic, OpenAI, or Google keys. Gemma 12B only fits the first bucket.

What developers actually run (June 2026)

OpenCode AI coding agent homepage on opencode.ai

OpenCode terminal agent UI. Screenshot from opencode.ai, captured June 5, 2026. Use with Ollama `baseURL` for local Gemma.
AgentTypeLocal Gemma 4 12B?Typical role
Claude CodeTerminal + IDE + desktopNo (cloud Anthropic; gateway only)Daily agent work, MCP, subagents
Codex CLITerminal (OpenAI)Yes. native ollama / lmstudio providerscodex exec, worktrees, automation
OpenCodeOpen terminal + desktopYes. Ollama + any OpenAI-compatible URLFree/open multi-provider agent
CrushTerminal (Charm)Yes. base_url in configTUI coding agent, MCP, LSP-aware edits
CursorAI IDE + CLINo (BYOK cloud keys only)IDE Agent, Cloud Agent handoff
Antigravity + CLIGoogle IDE + terminalNo (cloud models)Consumer Gemini CLI moves to Antigravity CLI (Google blog); enterprise Gemini CLI continues
GitHub Copilot cloud agentCloud PR agentNoRepo-wide tasks on GitHub infra
PiMinimal harnessYes. models.jsonPower users, extensions, local control
OpenClawMessaging orchestratorVia wired backendDelegates to Codex/Cursor/Claude from chat

Google’s launch blog still name-drops older OpenAI-shim tools. In June 2026 threads the local wiring is usually OpenCode, Codex CLI, or Crush.

  1. Run weights: ollama pull gemma4:12b or litert-lm serve.
  2. Local agent: ollama launch opencode or Pi/Codex pointed at http://localhost:11434/v1.
  3. Hard tasks: same repo in Claude Code (Opus) or Codex (GPT-5.5 class) on cloud.
  4. Google UI lane: Antigravity or Antigravity CLI. Google is retiring the consumer Gemini CLI in favor of Antigravity CLI; enterprise Gemini CLI continues. Local Gemma stays on Ollama/OpenCode.

OpenCode + Ollama (copy-paste pattern)

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": { "baseURL": "http://localhost:11434/v1" },
      "models": {
        "gemma4:12b": { "name": "Gemma 4 12B (local)" }
      }
    }
  }
}

Docs: Ollama + OpenCode. Start around 8k to 16k context on a 16 GB box. Raise num_ctx until VRAM complains; 64k is wishful on most laptops.

Codex CLI + Ollama (sketch)

In ~/.codex/config.toml (verify-live against advanced config):

[model_providers.local_ollama]
base_url = "http://localhost:11434/v1"

Then select gemma4:12b as the model for a sandboxed task. Codex is the OpenAI agent CLI. Useful when you already live in codex exec but want offline weights.

Crush + local OpenAI-compatible API

Point an OpenAI-compatible provider at Ollama or LiteRT-LM in crush.json (see Crush README configuration section). Same pattern as Open WebUI + local server setups.

Coding fit table (Gemma 4 12B itself)

Use caseFit
Offline refactors via OpenCode/CrushGood with Q4 + 8K–16K context
Same repo as Claude Code/Codex cloudHybrid: local for private files, cloud for ship
Tool-calling agentsStrong vendor Tau2 scores; test JSON schema in OpenCode
Repo-wide 128K reasoningPossible in theory; watch VRAM on 16 GB
Antigravity / Cursor defaultUse cloud models; Gemma is a parallel local lane

API and cloud routes (gemma 4 12b api)

Route12B status (June 5, 2026)
Hugging Face Inference / Endpointsgoogle/gemma-4-12B-it
OpenRouter google/gemma-4-12b-itNot listed; use google/gemma-4-31b-it if you need hosted Gemma 4 today
Google AI Studio / Gemini APIFamily docs emphasize 26B A4B and 31B. Check Studio for 12B availability
Vertex Model GardenProduction Gemma 4 family
Local OpenAI shimlitert-lm serve, LM Studio server, llama-server

See our OpenRouter Free Models (2026) guide for router patterns; swap model IDs when 12B appears.

Troubleshooting

SymptomLikely causeFix
CUDA OOM at modest contextKV + vision tokensLower num_ctx; Q4 quant; fewer image tokens
VRAM much larger than 7.6 GBEmbedding tables + f16 KVCap context; OLLAMA_KV_CACHE_TYPE=q8_0
Vision fails in llama.cppMissing mmprojAdd mmproj-BF16.gguf
AutoModel class errorsOld Transformerspip install -U transformers>=5.10.1
No audio in Ollama 12bTag limitationTransformers, LiteRT-LM, or vLLM
Empty thought tags with thinking off12B template quirkenable_thinking=False or strip in post-process
vLLM model not foundNeeds unified nightlyUse gemma4-unified container per recipe

Who should use, watch, or skip

AudienceVerdict
Privacy-first dev with 16 GB GPUUse: Ollama or UD-Q4_K_XL + mmproj
Multimodal agent on MacUse: LiteRT-LM or MLX 4-bit
Production API at scaleWatch: prefer 31B hosted or vLLM on 40 GB+ until 12B API listings stabilize
Best local coder onlyWatch: benchmark against Qwen 3.5 9B on your prompts
Need guaranteed commercial audio rights for adsSkip for client work until legal reviews Terms

Changelog

  • 2026-06-05: First publish. Local setup for Ollama, MLX, LiteRT-LM, and 2026 coding agents (OpenCode, Codex CLI, Crush). Consumer Gemini CLI moves to Antigravity CLI.
  • 2026-06-08 (update): Added Gemma 4 QAT (Quantization-Aware Training) checkpoints — Q4_0 GGUF, vLLM compressed tensors, and Unsloth QAT GGUFs for 12B. MTP QAT checkpoints also available.

Frequently asked

10 questions
What is Gemma 4 12B?

Gemma 4 12B Unified is Google DeepMind's encoder-free open model with about 11.95 billion parameters. It handles text, images, and audio in one decoder-only transformer, supports up to 256K context on the model card, and ships under Apache 2.0. Use the instruction-tuned checkpoint google/gemma-4-12B-it for chat and agents.

How much VRAM do I need for Gemma 4 12B locally?

A Q4_K_M GGUF or Ollama gemma4:12b build needs roughly 7.6 GB for weights plus a vision projector and extra memory for KV cache. Plan 16 GB unified memory or VRAM for comfortable 8K–16K context. Eight-gigabyte GPUs can run aggressive Q3 quants with short context only.

Does Ollama support Gemma 4 12B?

Yes. Run ollama pull gemma4:12b for a 7.6 GB Q4_K_M build with text and image input and 256K advertised context. Ollama's 12b tag does not list audio today; use Transformers, LiteRT-LM, or vLLM nightly for native 12B audio on your machine.

Is Gemma 4 12B good for coding?

It scores well on vendor coding benchmarks such as LiveCodeBench v6 and Codeforces ELO versus Gemma 3. Early community tests show workable local coding at about 5 tokens per second on a 12 GB GPU, but many devs still prefer Qwen 3.5 9B or larger Gemma 4 sizes for agent loops. Run your own repo prompts before you switch stacks.

How do I enable thinking mode on Gemma 4 12B?

Add the think token at the start of the system prompt or pass enable_thinking=True in Transformers apply_chat_template. The model emits a thought channel before the final answer. Strip thought blocks from chat history on the next turn except during multi-step tool calls in one agent turn.

Gemma 4 12B vs Gemma 3 12B: which should I download?

Pick Gemma 4 12B if you want native audio at this size, encoder-free multimodal fusion, Apache 2.0 licensing, 256K context, and stronger math and coding scores on Google's published tables. Stay on Gemma 3 12B only if you already tuned a pipeline and do not need audio or the new chat template yet.

Can I use Gemma 4 12B on a Mac?

Yes. Ollama offers gemma4:12b-mlx for text at about 10 GB on Apple Silicon, or use mlx-community/gemma-4-12B-it-4bit for vision-capable MLX at about 11 GB. Google's AI Edge Gallery and LiteRT-LM also target Mac laptops with OpenAI-compatible local serve.

Is Gemma 4 12B on OpenRouter?

As of June 5, 2026, OpenRouter lists google/gemma-4-31b-it and free variants, not the 12B slug. For hosted API access to the 12B class, check Google AI Studio or Vertex Model Garden. For local agents, run Ollama or litert-lm serve and point OpenCode, Codex CLI, Crush, or Pi at the OpenAI-compatible base URL.

Which AI coding agents work with local Gemma 4 12B in 2026?

OpenCode, Pi, Codex CLI, and Crush support custom base URLs or native Ollama providers. Claude Code, Cursor, Google Antigravity, and GitHub Copilot cloud agent expect cloud models for full agent features. Consumer Gemini CLI is moving to Antigravity CLI; enterprise Gemini CLI continues. Pair Gemma locally with OpenCode or Crush for terminal agents. Keep Claude Code on Anthropic and Codex on OpenAI for hard production repos.

What is Gemma 4 QAT and where do I get the 12B checkpoints?

QAT (Quantization-Aware Training) checkpoints released June 5, 2026 simulate quantization during training, so compressed weights hold more quality than standard post-training quants. For 12B, grab the Q4_0 GGUF at google/gemma-4-12B-it-qat-q4_0-gguf (~6.7 GB), compressed tensors for vLLM at google/gemma-4-12B-it-qat-w4a16-ct, or Unsloth QAT GGUFs at unsloth/gemma-4-12B-it-qat-GGUF. MTP QAT checkpoints are also available.

More in Models

View all