Every data team knows the drill: product managers want answers, analysts are underwater, and self-serve BI dashboards only get you so far. GitHub just published a detailed teardown of how they solved this problem internally — and the patterns they describe are surprisingly relevant to anyone building AI agent teams on their own infrastructure.
On June 19, 2026, GitHub engineers Matteo Vasirani and Cynthia Joseph published "How we built an internal data analytics agent." The post walks through the architecture of Qubot, a Copilot-powered agent that lets any GitHub employee ask plain-language questions about the company's data warehouse and get answers within seconds. It's not a dashboard or a reporting tool — it's an agent purpose-built for exploratory questions like "Which cohort of users has the highest retention on this feature?" or "What product contributed to move this metric the most last week?"
The post is worth reading in full, but what caught our attention isn't just the product. It's the architectural thinking behind it — and how much of it maps onto the challenges teams face when deploying AI agents in production, especially in self-hosted, multi-role setups.
The Three-Layer Architecture
Qubot's design splits cleanly into three components: a user interface layer, a context layer, and a query engine layer. This separation isn't just tidy engineering — it's a lesson in how to make AI agents actually reliable.
The interface layer spans Slack, VS Code, and the Copilot CLI. Slack is the preferred entry point: a user posts a question in a dedicated channel, a Qubot instance spawns as a Copilot Cloud Agent, and the answer arrives directly in Slack. The user can iterate in-thread, and every result is also stored as a markdown report in a pull request for future reference. This design choice — making results both conversational and persistent — is a pattern worth stealing.
The query engine layer connects to two systems: Kusto for fast exploratory queries over recent event data, and Trino for complex joins and deeper historical analysis. Critically, users don't need to know which engine to use. Qubot defaults to Kusto and automatically switches to Trino when the question demands it. GitHub built a custom Trino MCP server and deployed a local version of the Fabric RTI MCP Server for Kusto. The abstraction hides infrastructure complexity from the human asking the question — exactly what agents should do.
But the most instructive piece is the middle layer.
Federated Context: The Real Innovation
GitHub's data warehouse follows a standard medallion architecture — bronze (raw events), silver (conformed facts and dimensions), and gold (curated business datasets). Qubot's context layer mirrors this structure with federated knowledge contributions:
- Bronze data gets telemetry context contributed by product teams — schema information and metadata.
- Silver data gets query examples, usage guidance, and mandatory filters maintained by the data and analytics team.
- Gold data gets business rules and metric definitions contributed by the teams owning those datasets.
This is federated context done right. Instead of one central team trying to document everything (a losing battle at scale), knowledge lives where it's produced. A context agent then ingests, organizes, and normalizes contributions into a structured format that Qubot can consume at runtime via the GitHub MCP Server. Teams contribute through a standardized template or by referencing a repository containing relevant context. Since GitHub primarily uses markdown for documentation, there's no integration tax — no juggling multiple tools.
The lesson for teams building AI agents: context architecture matters more than model selection. The smartest LLM in the world produces garbage if it's operating against an empty or poorly structured knowledge layer. GitHub's federated model — where domain owners contribute structured knowledge that gets loaded at runtime — is a pattern any team can replicate.
Evaluation Before Deployment
The other standout detail is GitHub's evaluation framework. Every change to the context layer or agent configuration is tested before it ships. When someone enriches the context with new knowledge via a pull request, an offline evaluation framework measures accuracy, latency, and regressions.
The benchmarking system has three components:
- Test cases: A curated dataset of prompts with known correct answers, ground-truth SQL, and metadata like domain and difficulty.
- Automated run orchestration: A script that launches each test case as an agent task using the GitHub CLI (
gh agent-task create), runs multiple parallel trials, polls for completion, and saves detailed JSON results. - Stats aggregation: A reporting script that computes per-test-case metrics — completion rate, accuracy, and duration (average, min, max).
The end-to-end flow is straightforward: define test cases → run Qubot N times per case → collect results → aggregate stats → compare configurations.
This is the kind of rigor that most AI agent deployments skip. Teams ship agents, cross their fingers, and hope users don't notice when accuracy degrades after a context update. GitHub's approach — treating every context change like a code change that needs to pass tests — is the right mental model. And the fact that they run multiple trials per test case acknowledges something important about LLM behavior: stochasticity is real, and single-run evaluations lie.
What This Means for Self-Hosted AI Teams
GitHub is a large company with dedicated data platform engineers. But the patterns they describe are not locked to their scale. Here's what's transferable:
Layered context beats monolithic prompts. Whether you're running a data analytics agent or a multi-role AI team, structured context that's loaded at runtime — not crammed into a system prompt — produces more reliable results. Separating metadata, usage guidance, and business rules into distinct, maintainable layers is something any team can do, even with a modest knowledge base.
Federated contribution scales; central documentation doesn't. If your AI agents need to know about five different domains, have the domain experts contribute structured knowledge rather than expecting one person to document everything. Standardized templates and ingestion agents lower the friction.
MCP is the integration pattern. GitHub's entire query engine layer runs through MCP servers — a custom one for Trino, a deployed instance for Kusto. The MCP protocol is becoming the standard for connecting agents to external tools and data sources. Teams building self-hosted agent architectures should be planning around it.
Evaluation is not optional. If you're running agents that produce outputs people rely on, you need a testing framework. It doesn't have to be as sophisticated as GitHub's, but the core loop — curated test cases, automated runs, aggregated metrics — is achievable for any team.
These patterns — federated knowledge layers, MCP tool integration, multi-role agents with persistent memory — are exactly what OfficeForge is built around. Five AI employees (researcher, coder, copywriter, secretary, designer) running on your own VPS, each with structured skills and access to external tools via MCP. The evaluation rigor GitHub describes? OfficeForge agents keep a shared knowledge graph so they don't re-research the same facts — context compounds, not resets.
Get OfficeForge — $199The Bigger Picture
Qubot has been widely adopted internally at GitHub, with hundreds of users running thousands of queries. The number of questions directed to data and analytics Slack channels has dropped dramatically — not because people stopped being curious, but because they can now explore data on their own.
This is the promise that "self-serve analytics" has made for decades and never delivered. What changed isn't just that LLMs can generate SQL. It's that the full stack — interface, context management, query routing, evaluation — has matured enough to make the output trustworthy at scale.
For teams evaluating how to deploy AI agents in their own organizations — whether for data analytics, content production, code generation, or research — GitHub's post is a masterclass in getting the architecture right. The model matters, but the context layer, the tool integration, and the evaluation loop are what separate a demo from a production system.
The tools to build this yourself are more accessible than ever. Whether you're using OfficeForge's self-hosted AI team or assembling your own stack, the principles are the same: structured context, clean tool integration, persistent memory, and rigorous testing. GitHub just showed their homework — and it's worth studying.
---
*Source: How we built an internal data analytics agent — The GitHub Blog, Matteo Vasirani & Cynthia Joseph, June 19, 2026.*
FAQ
What is Qubot?
Qubot is GitHub's internal, Copilot-powered analytics agent that lets any employee ask plain-language questions about GitHub's data warehouse and get answers in seconds — no SQL expertise required.
How does Qubot's context layer work?
Knowledge is contributed by different teams in a federated model: product teams add telemetry metadata, the data team adds query examples and usage guidance, and dataset owners contribute business rules. A context agent ingests and normalizes everything into structured format.
Which query engines does Qubot use?
Qubot connects to both Kusto (for fast exploratory queries over recent event data) and Trino (for complex joins and historical analysis). It defaults to Kusto and auto-switches when a question requires Trino.
How does GitHub evaluate Qubot's accuracy?
Every change to the context layer or agent configuration runs through an evaluation framework with curated test cases, automated orchestration that runs multiple parallel trials per case, and stats aggregation measuring completion rate, accuracy, and latency.
Why does this matter for smaller teams?
The architectural patterns GitHub used — federated context, MCP-based tool integration, layered data curation, and automated evaluation — are directly transferable to self-hosted AI setups where teams build multi-role agents on their own infrastructure.
