Open Agent Leaderboard

Results

What the Data Shows

Five findings that challenge conventional assumptions about AI agents.

General agents adapt without manual tuning

Software engineering, customer service, web research, and everyday digital tasks, all without per-domain manual customization or training.

The model matters most, but so does the agent

Model choice drives 28% of task variance, but agent choice still swings results by up to 11 percentage points.

General agents already rival specialists

On half the tested benchmarks, general agents match or beat the top published scores from domain-specific systems, with none of the per-domain engineering.

Open-weight models aren't reliably general

The open-weight models we tested can swing wildly depending on the agent, or collapse on entire tasks. Closed-source frontier models hold up.

Similar scores, very different failures

Two agents with the same overall score can fail in very different ways: some stop too early, others skip past the evidence. Scores alone don't tell the whole story.

Read the Full Paper (arXiv) →

FAQ

Frequently Asked Questions

Common questions about Exgentic's evaluation methodology and results.

Many agent or model whitepapers rely on significant prompt optimization to maximize performance on specific benchmarks. In Exgentic, we intentionally avoid prompt optimization to provide a more neutral and comparable evaluation across agents.

In addition, we report results on 100 sampled tasks per benchmark. For some benchmarks this is a subset of the full dataset, but we believe this is not the primary factor driving differences in results.

No. We do not modify the benchmarks or the agent implementations. In some cases, we needed to slightly adapt the interface to fit the unified protocol—for example: Externalizing prompts that are embedded inside the benchmark (e.g., TAU-Bench). Adding task instructions when required by the benchmark specification (e.g., SWE-Bench Verified). These adjustments do not change the task itself.

We selected agents that represent commonly used general-agent architectures and that can operate across multiple benchmarks with minimal task-specific customization. Our goal is to evaluate general-purpose agents, not agents tailored to a single benchmark. We will add more agents, and we welcome contributions from the community.

Since Exgentic focuses on evaluating generality, we chose benchmarks that cover a diverse range of domains and task types, including coding, tool use, reasoning, and interactive environments.

We started with a small set of widely used models to establish the initial leaderboard. We plan to expand the leaderboard soon with many open-weight models as well as additional closed models.

Yes. Exgentic is fully open. The evaluation pipeline, agent implementations, and configuration are available in the repository so results can be reproduced. Minor variations may occur due to model version changes or nondeterminism in LLM outputs.

Get Involved

Open Source. Open Data. Open Leaderboard.

Everything is open. Jump in.

Run Evaluations

Clone the repo. Evaluate your agent against the field. Results in hours, not weeks.

Get the Code

Submit Your Agent

Put your agent on the board. See exactly where it stands against the field.

Submit

Add a Benchmark

Plug your evaluation environment into the Unified Protocol. Instant cross-agent testing.

Integrate

Insights

Blog

Research updates, benchmarks, and findings from the Exgentic team.

May 13, 2026 · 10 min read

Open Agent
Leaderboard

The Agent Leaderboard

Leaderboard

What Does Performance Cost?

What the Data Shows

General agents adapt without manual tuning

The model matters most, but so does the agent

General agents already rival specialists

Open-weight models aren't reliably general

Similar scores, very different failures

The Evaluation Graph

Frequently Asked Questions

Open Source. Open Data. Open Leaderboard.

Run Evaluations

Submit Your Agent

Add a Benchmark

Blog

Are open-weight models fit for agentic workloads?

Rethinking Agent Evaluation Reporting

The Open Agent Leaderboard

Open AgentLeaderboard

The Agent Leaderboard

Leaderboard

What Does Performance Cost?

What the Data Shows

General agents adapt without manual tuning

The model matters most, but so does the agent

General agents already rival specialists

Open-weight models aren't reliably general

Similar scores, very different failures

The Evaluation Graph

Frequently Asked Questions

Open Source. Open Data. Open Leaderboard.

Run Evaluations

Submit Your Agent

Add a Benchmark

Blog

Are open-weight models fit for agentic workloads?

Rethinking Agent Evaluation Reporting

The Open Agent Leaderboard

Open Agent
Leaderboard