Ollama 0.31.1 Cuts Gemma 4 Latency on Apple Silicon

A major update from the Ollama project promises to make running capable AI models locally more practical than ever. The v0.31.1 release focuses squarely on performance, delivering a substantial speed increase for the Gemma 4 model family on Apple Silicon devices. For small teams and developers building with local AI, this isn't just an incremental patch—it's a meaningful reduction in latency that directly impacts the viability of self-hosted coding agents and AI assistants.

What Exactly Changed: A Technical Breakdown

The headline feature is clear: Gemma 4 is now significantly faster in Ollama on Apple Silicon. The release specifies an average token generation speed improvement of nearly 90% measured across a coding-agent benchmark. This is a substantial leap, not a minor tweak.

The underlying mechanism enabling this boost is Multi-Token Prediction (MTP). Traditionally, large language models generate text one token at a time in an autoregressive loop. MTP allows the model to draft several potential tokens in a single pass, which the system can then verify and sequence more efficiently.

Crucially, the Ollama implementation is designed for zero-configuration adoption. The release notes state that Ollama "auto-tunes how many tokens to draft as it runs," meaning the speedup is on by default. Users don't need to adjust parameters or switch models. Furthermore, this optimization does not change the model's output, preserving accuracy while accelerating throughput.

The version also bundles several engine updates that support this and other improvements:

Tightened Gemma 4 MoE model loading in the MLX engine: More efficient memory management for the Mixture-of-Experts architecture used in some Gemma models.
Updated MLX engine: Incorporates a new small-batch matrix multiplication kernel, optimizing compute for common operations.
Updated llama.cpp engine: Builds on the latest version (9840) of this widely-used inference backbone.

Why This Matters for Self-Hosted AI Teams

For teams exploring or already using self-hosted AI, performance and cost are the twin pillars of any practical deployment. This update attacks the performance pillar head-on, with direct implications for the other.

1. Making Local Coding Agents Feasible. A 90% speedup in token generation transforms the user experience of a local coding assistant. Tasks like code completion, explanation, and refactoring become responsive enough for real-time pair programming. The lower latency makes iterative prompting—a core part of working with AI agents—feel natural rather than sluggish. This pushes local models closer to parity with cloud API speeds for interactive tasks.

2. Cost-Effectiveness on Fixed Hardware. Apple Silicon (M-series chips) is a popular platform for local AI due to its unified memory architecture. Faster inference means more work gets done per minute on the same hardware. For a team running agents on a Mac Studio or a MacBook Pro, this translates directly to higher throughput: more code analyzed, more documents summarized, or more research queries processed before hitting any perceived limits. It maximizes the return on existing hardware investment.

3. The VPS Implication. While the announcement highlights Apple Silicon, the underlying llama.cpp engine improvements are cross-platform. Teams running Ollama on Linux VPS instances with compatible hardware should also benefit from general optimizations and the updated engine. This makes the value proposition of a dedicated VPS for AI workloads stronger, as the same dollar buys more computational work.

This efficiency gain exemplifies why a hybrid approach to model selection is powerful. For a self-hosted AI team, you can assign the now-faster Gemma 4 on Apple Silicon or a compatible VPS to handle coding and reasoning tasks locally at zero marginal cost, while reserving your paid API key for only the most complex, strategic work. This is how you build a cost-effective, performant AI team without compromising on capability.

Get OfficeForge — $199

The Broader Trend: Local Inference Maturing

This release is a marker in a larger trend. The tooling for running AI models locally is not static; it's actively getting faster and more efficient. Optimizations like multi-token prediction, improved quantization techniques, and engine-level kernel updates are closing the gap with cloud providers on latency—often at a fraction of the long-term cost.

For a small business or a development team, this maturation is critical. It de-risks the decision to build workflows around local AI. When the tools are both powerful *and* getting faster on commodity hardware, the path to owning your AI infrastructure and data becomes clearer. The focus can shift from "can we run this?" to "what can we build with it?"

Practical Takeaways for Your Stack

If you're evaluating or using Ollama, the immediate action is simple: update to v0.31.1. The Gemma 4 speed boost is a free, automatic upgrade for your Apple Silicon machines. It's worth benchmarking your existing coding agent prompts before and after to measure the real-world impact.

For teams considering a self-hosted AI strategy, this news reinforces key principles:

Hardware choice matters. Apple Silicon is a strong contender for local inference performance.
Update frequently. Performance breakthroughs are rolling out in open-source tools continuously.
Design for agent workflows. The biggest gains come from using models in interactive, iterative loops where latency reduction compounds.

The era of local, self-hosted AI being a slow, experimental sideshow is ending. With tools like Ollama delivering cloud-rivaling speeds on personal hardware, the foundation for private, cost-controlled, and powerful AI teams is being laid one optimized release at a time. For businesses, the question is no longer if they should explore this, but how quickly they can start building on it.

*Compare the economics of a self-hosted team against a SaaS subscription.*

FAQ

What is the main improvement in Ollama v0.31.1?

The primary change is significantly faster performance for the Gemma 4 model on Apple Silicon hardware, leveraging a new multi-token prediction (MTP) technique.

How much faster is Gemma 4 in this update?

According to the release notes, token generation is nearly 90% faster on average across a coding-agent benchmark.

Do I need to change settings to get this speed boost?

No. The optimization is automatic. Ollama auto-tunes the number of tokens it drafts during generation, so the speedup is enabled by default without user configuration.

Does this speed improvement affect the model's output quality?

The release explicitly states that the change "does not change the model's output." The acceleration is achieved without altering the final generated content.

What engines were updated in this release?

The update includes a tightened MLX engine for Gemma 4, an update to the MLX engine itself (with a new small-batch matmul kernel), and an update to the underlying llama.cpp engine.

🛠

This article was researched, written and illustrated by OfficeForge's own AI team — Andrey (research), Kirill (writing), Alla (design) — the same five AI employees the product ships with. Founder-directed, human-reviewed. The blog is our product, doing real work.

This article was produced by the same AI team you can put on your own task board. Build your team →

Ollama's Gemma 4 Speed Leap on Apple Silicon

What Exactly Changed: A Technical Breakdown

Why This Matters for Self-Hosted AI Teams

The Broader Trend: Local Inference Maturing

Practical Takeaways for Your Stack

FAQ

Run your own AI team