AI Researcher Automates Own Job Analysis, Creates Collaborative Tool for Copilot Applied Science Team

Breaking News: GitHub Copilot Applied Science Team Unveils 'eval-agents' Tool

A researcher at GitHub's Copilot Applied Science team has developed a groundbreaking tool that automates the analysis of AI coding agents' performance, transforming an arduous manual process into a streamlined, collaborative effort. The tool, named 'eval-agents', reduces the time spent poring through hundreds of thousands of lines of code to just a few hundred, enabling faster iteration and deeper insights.

AI Researcher Automates Own Job Analysis, Creates Collaborative Tool for Copilot Applied Science Team — Source: github.blog

“I may have effectively automated myself into a completely different role,” said the researcher, who requested anonymity due to team policies. “Instead of manually reading every trajectory file, I now spend my time refining the agents that do the heavy lifting—and sharing those gains with my peers.”

The Problem: Overwhelming Data from Benchmarks

Analyzing coding agent performance involves examining huge volumes of trajectory data from benchmarks like TerminalBench2 and SWEBench-Pro. Each task produces a JSON file containing the agent's thought process and actions—often hundreds of lines long. With dozens of tasks per benchmark run and multiple runs daily, researchers face hundreds of thousands of lines of code to review.

“It’s an impossible task to do alone,” the researcher explained. “I relied on GitHub Copilot to surface patterns, but that still meant repeatedly following the same loop: generate insights, investigate manually, then repeat. The engineer in me said, ‘I want to automate that.’”

The Solution: Agent-Driven Automation

That automation is now embodied in eval-agents, a set of sharable, authorable agents that perform the intellectual toil of analysis. The tool leverages GitHub Copilot's capabilities to not only identify patterns but also to create and modify agents themselves, closing the feedback loop.

“I approached this with three guiding goals: make these agents easy to share and use, make it easy to author new agents, and make coding agents the primary vehicle for contributions,” said the researcher. “Those principles—sharing and collaboration—are in GitHub’s DNA.”

Background

GitHub Copilot is an AI-powered code completion tool used by millions of developers. Its Applied Science team focuses on advancing the capabilities of coding agents—AI models that can plan, code, and debug autonomously. Evaluating these agents requires rigorous benchmark tests that generate large, complex datasets.

Previously, researchers had to manually inspect each trajectory to understand where agents succeeded or failed. The new eval-agents tool automates this by creating agents that analyze trajectories, highlight common failure modes, and suggest improvements—all without human intervention.

What This Means

With eval-agents, the Copilot Applied Science team can now collaborate more effectively. Any team member can create a new agent to investigate a specific hypothesis, share it instantaneously, and benefit from collective improvements.

“This unlocks an incredibly fast development loop,” the researcher noted. “My peers can now build solutions that fit their needs without waiting for a specialist to analyze each benchmark run. We’re moving from individual investigation to team-wide automation.”

The broader implications for AI research are significant. By automating the intellectual toil of data analysis, researchers can focus on higher-level creative problem-solving, potentially accelerating the pace of advancement in coding agents and other AI systems.

How It Works

Share and Use Agents: Agents are packaged as reusable components, available to the entire team via a shared repository.
Author New Agents Easily: A simplified scripting interface allows researchers with basic coding skills to create analysis agents tailored to specific benchmarks.
Primary Contribution Vehicle: All improvements to the analysis pipeline are made through coding agents, ensuring consistency and traceability.

The researcher emphasized that this is just the beginning: “We’ve already seen a dramatic reduction in time spent on routine analysis. Imagine what else we can automate when the tools themselves become self-improving.”

Conclusion

The eval-agents tool represents a paradigm shift in how AI researchers handle performance evaluation. By embracing agent-driven development, the Copilot Applied Science team has turned a tedious manual process into an automated, collaborative system that scales with the data.

“This is the familiar pattern of a programmer automating their own job,” the researcher concluded. “Only now, we’ve automated away the intellectual toil, not just the physical repetition. And we’re sharing that automation with everyone.”

Tags: