Google's Gemma 4: Apache 2.0 Changes Everything

The License Was Always the Problem

For the better part of two years, Google's Gemma models occupied a strange limbo in enterprise AI evaluation. The performance was real — Gemma 3 was genuinely competitive with open-weight peers — but the custom Google license made procurement painful. Legal teams flagged ambiguous "harmful use" clauses. Compliance reviews stalled. The terms Google could revise unilaterally meant organizations were essentially building on shifting ground.

The result: teams that needed a truly open foundation for commercial deployment went with Mistral, Qwen, or other models governed by standard Apache 2.0 terms instead, leaving Gemma largely to hobbyists and researchers willing to absorb the legal friction.

Gemma 4 removes that friction entirely. The entire family now ships under Apache 2.0 — identical licensing to Qwen, Mistral, and the majority of the serious open-weight ecosystem. No proprietary addenda. No clauses that invite legal interpretation. No uncertainty about whether a fine-tuned derivative can ship in a commercial product. This is the change enterprise teams were waiting for, and Google has paired it with the strongest open model release the company has ever shipped.

Four Models, Two Jobs

The Gemma 4 family consists of two tiers designed for distinct deployment contexts. On the edge side, the E2B and E4B are built for smartphones, embedded hardware like Raspberry Pi, and devices such as the Jetson Orin Nano. They handle text, image, and audio natively within a 128K-token context window. Google worked directly with Qualcomm and MediaTek to optimize these variants, and the Pixel team's involvement suggests these aren't theoretical mobile benchmarks — they're tuned against real hardware constraints.

The workstation tier offers a 31B dense model and a 26B Mixture-of-Experts (MoE) variant, both with 256K-token context windows and multimodal text-plus-vision input. The 31B targets maximum quality and serves as a fine-tuning base; the MoE prioritizes inference efficiency by activating only 3.8 billion of its 25-plus billion total parameters per forward pass — delivering roughly 27B-class output at closer to 4B-class compute cost. For teams running coding assistants or document pipelines at scale, that arithmetic matters enormously.

The naming deserves a quick translation. "E2B" means 2 billion effective parameters — the model actually carries more total weights through a technique called Per-Layer Embeddings, but the compute profile during inference matches a 2B model. The "A" in "26B A4B" signals active parameters: only 3.8B fire at inference time. Both conventions reflect Google's effort to be precise about what actually drives your GPU bill.

The MoE Architecture Is the Real Story

The design choices inside the 26B A4B warrant a closer look from anyone evaluating inference economics. Rather than following the trend toward a small number of large experts — the approach used in several competing MoE models — Google built the 26B around 128 small experts, routing eight per token alongside a single always-active shared expert. This granular routing tends to produce more stable specialization and, importantly, it lets the model match dense 27–31B performance while running at inference speeds competitive with a 4B model.

To put that concretely: fewer GPUs, lower latency, cheaper per-token cost in production. For agentic workflows processing thousands of turns, or document pipelines ingesting large corpora, the MoE variant may be the most economically sensible choice in the family.

Both workstation models also use a hybrid attention design — alternating local sliding window attention with full global attention, with the final layer always global. This is what makes the 256K context window practical rather than theoretical; it keeps memory footprint under control while preserving the model's ability to reason across very long inputs.

Multimodality is similarly native rather than bolted on. Variable aspect-ratio image input with configurable token budgets (70 to 1,120 tokens per image) lets developers dial the detail-versus-compute tradeoff explicitly. The edge models add a compressed audio encoder — 305 million parameters, down from 681 million in the prior generation — handling speech recognition and speech-to-translated-text entirely on-device. For healthcare, field service, or voice-first applications where data must stay local, running ASR, translation, reasoning, and tool calls inside a single on-device model is a genuine architectural win.

Function calling, meanwhile, was trained in from the start rather than coaxed through instruction-following. The implication for teams building tool-using agents: less prompt engineering overhead, more reliable structured output, and better multi-turn behavior out of the box.

Benchmarks: Strong, But Context Is Required

The 31B dense model scores 89.2% on AIME 2026, 80% on LiveCodeBench v6, and a Codeforces ELO of 2,150 — numbers that would have been state-of-the-art from closed proprietary APIs not long ago. On the Chatbot Arena open model leaderboard, it debuted at third place, sitting behind GLM-5 and Kimi K2.5 — both of which are dramatically larger models. The MoE variant trails modestly, posting 88.3% on AIME 2026 and 77.1% on LiveCodeBench, which is a small gap given the inference cost advantage.

Third-party validation from Artificial Analysis puts the 31B at second among open models under 40B parameters on GPQA Diamond (scientific reasoning), just behind Qwen3.5 27B, while also showing lower compute consumption than comparable Qwen models.

The edge models outperform Gemma 3 27B on most benchmarks — without reasoning mode — despite being a fraction of the size. That generational jump is the clearest signal that the architecture improvements are substantive, not just parameter scaling.

That said, benchmark rankings in the open-weight space shift weekly. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively at similar parameter ranges. What differentiates Gemma 4 isn't any single number but the overall package: competitive reasoning, native multimodality, function calling, 256K context, and a clean license — available across a coherent family that spans smartphone to server.

What This Means

The Apache 2.0 switch is the headline that will matter most over the next twelve months. Google is moving toward openness at precisely the moment some Chinese labs are quietly pulling back — Alibaba's most recent Qwen releases have introduced restrictions that weren't present in earlier versions. This puts pressure on the remaining closed or semi-open model providers because it resets expectations: frontier-adjacent quality under truly permissive terms is now the baseline, not a differentiator.

For developers: You can build, ship, and sell products on top of Gemma 4 derivatives without a legal review. That removes one of the two primary objections that kept Gemma out of serious commercial pipelines.
For founders: The MoE variant's inference economics are worth modeling carefully before committing to a serving architecture. 27B-class output at 4B-class throughput is not a marketing claim — it's a GPU budget line item that could meaningfully affect unit economics at scale.
For enterprise architects: Google's serverless Cloud Run deployment (spinning to zero when idle) alongside Vertex AI and GKE means the same model weights can move from laptop evaluation to production cloud without a porting project. That continuity reduces integration risk.
For the open-weight ecosystem broadly: Meta's LLaMA series no longer has a monopoly on being the "safe" open-weight choice for commercial deployment. Google has arrived, fully, and that competition benefits everyone building on open foundations.

Whether additional Gemma 4 sizes follow — Google has hinted they might — the release as it stands is the most complete open model package Google has shipped. The wait for a Google open model that competes on licensing terms, not just benchmarks, is over.

Written by

Daily Neural Team