The open-source AI landscape just made a quiet but seismic shift. According to the latest Kilo model feed update from June 2026, six of the nine top-ranked open-weight coding models now ship with 1-million-token context windows. For teams building AI-powered development workflows, this is the inflection point where self-hosted agents stop being toys and start being infrastructure.
The 1M-Context Club
The numbers are striking. Of the nine featured open-weight coding models ranked by Kilo, the following support 1-million-token context windows:
- GLM 5.2 — Z.ai, 1M context, MIT license, Kilo Bench 53.0%
- MiniMax M3 — MiniMax, 1M context, open weights, Kilo Bench 47.6%
- DeepSeek V4 Pro — DeepSeek, 1M context, MIT license, Kilo Bench 44.0%
- DeepSeek V4 Flash — DeepSeek, 1M context, MIT license, efficiency-optimized
- Qwen3.7 Max — Alibaba, 1M context, Apache 2.0, Kilo Bench 54.6%
- Nemotron 3 Ultra — NVIDIA, 1M context, NVIDIA Nemotron Open license
The remaining three — Kimi K2.7 Code, Qwen3 Coder Next, and Devstral 2 — still offer a substantial 262K-token context window, which was considered frontier territory barely a year ago. The floor has risen dramatically.
Why 1M Tokens Changes the Agent Equation
A 1-million-token context window isn't just a bigger number on a spec sheet. It qualitatively changes what an AI coding agent can do in a single session.
Context window — The maximum amount of text (measured in tokens) a language model can process in a single forward pass. Longer windows let models reason over more code, documents, or conversation history without truncation or retrieval workarounds.
Consider the practical implications. A mid-sized Python codebase with tests, configuration files, and documentation might run 500K to 800K tokens. A set of business documents — contracts, financial reports, internal wikis — can easily exceed 200K tokens. Previously, an AI agent working on these materials had to chunk, summarize, or use retrieval-augmented generation (RAG) to cope with limited context. Each of these workarounds introduces information loss, latency, and complexity.
With 1M-token models, an agent can ingest an entire project holistically. It sees the tests alongside the code they test. It reads the architecture doc and the implementation at the same time. For business workflows, it can hold a full quarterly report, the prior quarter's report, and the relevant strategy memo in a single pass — no retrieval pipeline required.
This matters most for long-horizon agent workflows, a use case specifically called out in the source material for both GLM 5.2 and MiniMax M3. When an AI agent needs to plan multi-step tasks — refactoring a module, writing integration tests, migrating an API — it needs to maintain coherent understanding across hundreds of file edits. A 1M context window gives it that continuity.
The Mixture-of-Experts Efficiency Revolution
The other story hiding in the specs is how these models manage to offer 1M context while remaining deployable. The answer is Mixture-of-Experts (MoE) architecture, and nearly every model on the list uses it.
The headline numbers look intimidating: DeepSeek V4 Pro has 1.6 trillion total parameters. But only 49 billion are activated per token. Nemotron 3 Ultra has 550 billion total parameters with just 55 billion active. Qwen3 Coder Next takes this to an extreme — 80 billion total parameters with only 3 billion activated per token thanks to a sparse MoE design.
This architecture has a direct consequence for teams running their own infrastructure. The memory and compute requirements are governed by the *active* parameter count, not the total. DeepSeek V4 Flash, explicitly described as "efficiency-optimized," activates just 13 billion of its 284 billion total parameters. That's in the territory of models that can run on a single high-end GPU — making self-hosted deployment genuinely feasible rather than theoretically possible.
The trend is clear: open-weight model designers are optimizing for real-world deployment constraints, not just benchmark leaderboards.
Benchmark Reality Check
The Kilo benchmark table compares these models across three measures: SWE-Bench Verified, Terminal-Bench 2.0/2.1, and LiveCodeBench. These are software-engineering-focused evaluations, not general knowledge quizzes — they measure whether a model can actually fix bugs, write working code, and navigate real repositories.
The standout performers:
- DeepSeek V4 Pro — 80.6% SWE-Bench Verified, 67.9% Terminal-Bench, 93.5% LiveCodeBench
- Kimi K2.6 — 80.2% SWE-Bench Verified, 66.7% Terminal-Bench, 89.6% LiveCodeBench
- Qwen3.6-27B — 77.2% SWE-Bench Verified (a 27B dense model punching well above its weight)
- GLM-5.1 — SOTA on SWE-Bench Pro and Terminal-Bench for open-source models
For context, SWE-Bench Verified scores above 80% represent models that can resolve the majority of real GitHub issues autonomously. That these scores are achieved by open-weight models you can run on your own hardware — or through your own API keys without vendor markup — is the real headline.
GLM-5.1 is recommended by the source as the "best overall agentic coding" model, though its 744B-A40B parameter profile demands serious infrastructure.
What This Means for Self-Hosted AI Teams
The convergence of massive context windows and MoE efficiency creates a specific opportunity: self-hosted AI agents that can reason over entire business artifacts without external API dependencies.
This is exactly the scenario a self-hosted AI team is designed for. When your coder agent runs on your own VPS with a 1M-context model and your own API key, it can process your full codebase or lengthy business documents in a single pass — no data leaves your infrastructure, and you pay the model provider directly at their standard rate. With a one-time $199 purchase and your own key from OpenRouter or another provider, the economics are fundamentally different from per-seat SaaS that charges markup on every token.
Get OfficeForge — $199Here's what shifts concretely:
Codebase-level understanding. A developer agent with 1M tokens of context can hold an entire repository — source, tests, configs, CI definitions, documentation — without splitting it into chunks. This eliminates the failure mode where an agent "forgets" that a function is used elsewhere when refactoring. Kimi K2.7 Code, built specifically for "end-to-end programming tasks reliably over long contexts," exemplifies this design philosophy.
Document-intensive business workflows. Legal review, financial analysis, compliance auditing — these tasks involve long, interconnected documents where context loss is costly. A research or secretary agent running Qwen3.7 Max (optimized for "office and productivity tasks") with 1M context can hold multiple related documents simultaneously.
Reduced infrastructure complexity. Without the need for elaborate RAG pipelines to work around context limits, the self-hosted stack gets simpler. Fewer moving parts means fewer things to maintain, debug, and secure — a meaningful advantage for small teams running their own VPS.
Cost control through architecture. MoE models like Qwen3 Coder Next (3B active) and DeepSeek V4 Flash (13B active) let teams route routine tasks to efficient models while reserving larger models for complex reasoning. This is the "pick the right brain for the job" strategy that only works when you control the deployment.
The Licensing Landscape
The licensing story is equally encouraging for teams building on open foundations. Four of the nine featured models use the MIT license (GLM 5.2, DeepSeek V4 Pro, DeepSeek V4 Flash). Three use Apache 2.0 (Qwen3 Coder Next, Qwen3.7 Max, Devstral 2). MiniMax M3 uses open weights. Only Nemotron 3 Ultra uses a custom NVIDIA license.
Both MIT and Apache 2.0 are permissive licenses that allow commercial use, modification, and redistribution without significant restrictions. For businesses that need to audit their AI supply chain — particularly in regulated industries — this licensing clarity matters as much as the model's capabilities.
Who Should Pay Attention
This news matters most for three groups:
Small engineering teams that want AI coding assistance but can't justify or don't want per-seat SaaS costs. Running a 13B-active-parameter model like DeepSeek V4 Flash on a modest GPU and using it via your own API key fundamentally changes the cost equation.
Businesses in regulated industries (finance, legal, healthcare) where data sovereignty isn't optional. Self-hosted deployment with no data leaving the infrastructure — enabled by the fact that these models are openly downloadable — turns AI coding assistance from a compliance risk into a compliance advantage.
AI tool builders and integrators who need models they can customize, fine-tune, and deploy without negotiating enterprise licensing agreements. The combination of permissive licenses and MoE efficiency makes it economically viable to build specialized agents on top of these foundations.
The Bigger Picture
The June 2026 snapshot from Kilo reveals something larger than individual model capabilities. The open-weight ecosystem has reached a maturity point where the best open models compete with proprietary alternatives on real software-engineering benchmarks — while offering the deployment flexibility that proprietary models deliberately withhold.
For teams evaluating OfficeForge vs ChatGPT Teams or similar trade-offs, the question is no longer "can open models match proprietary ones?" The data shows they can. The question is whether your team wants to build the infrastructure to take advantage of that — or whether you want a turnkey self-hosted solution that handles the integration for you.
Either way, the era of 1M-context open-weight models is here, and it changes the calculus for every team thinking about where their AI agents should run.
FAQ
What open-source coding models offer 1M-token context in 2026?
As of June 2026, GLM 5.2, MiniMax M3, DeepSeek V4 Pro, DeepSeek V4 Flash, Qwen3.7 Max, and Nemotron 3 Ultra all support 1M-token context windows. Kimi K2.7 Code, Qwen3 Coder Next, and Devstral 2 offer 262K.
Which open-weight model scores highest on SWE-Bench Verified?
DeepSeek V4 Pro leads the benchmark table at 80.6% on SWE-Bench Verified, closely followed by Kimi K2.6 at 80.2%. Both use the MIT or Modified MIT license.
Can I run these 1M-context models on my own server?
Yes. Most models are available locally via Ollama, LM Studio, or vLLM. Mixture-of-Experts designs like DeepSeek V4 Flash (13B active parameters) and Qwen3 Coder Next (3B active) make local deployment more practical than their total parameter counts suggest.
What does BYO key mean for open-weight models?
Bring Your Own Key means you supply an API key from a provider like OpenRouter or OpenAI directly. The platform uses your key without marking up token costs — you pay the provider's standard rate.
Are open-weight models competitive with proprietary coding models?
According to the June 2026 Kilo benchmark data, top open-weight models like DeepSeek V4 Pro (80.6% SWE-Bench) and GLM-5.1 (SOTA on SWE-Bench Pro and Terminal-Bench) are competitive with proprietary offerings on real software-engineering tasks.
