How good are general purpose AI agents?
We test them across diverse domains.
No hand-tuning. No shortcuts. Fairly Ranked.
Head-to-head performance across diverse benchmarks.
Which agents deliver the most for the least?
Three findings that challenge conventional assumptions about AI agents.
General-purpose agents match domain-optimized baselines — without ever seeing the environment before. Specialization is overrated.
Swap the model, change the outcome. Swap the scaffold, barely notice. The underlying LLM drives performance more than the agent architecture.
Across every benchmark, general agents match or beat heavily customized systems. The era of one-trick agents is ending.
Common questions about Exgentic's evaluation methodology and results.
Many agent or model whitepapers rely on significant prompt optimization to maximize performance on specific benchmarks. In Exgentic, we intentionally avoid prompt optimization to provide a more neutral and comparable evaluation across agents.
In addition, we report results on 100 sampled tasks per benchmark. For some benchmarks this is a subset of the full dataset, but we believe this is not the primary factor driving differences in results.
No. We do not modify the benchmarks or the agent implementations. In some cases, we needed to slightly adapt the interface to fit the unified protocol—for example: Externalizing prompts that are embedded inside the benchmark (e.g., TAU-Bench). Adding task instructions when required by the benchmark specification (e.g., SWE-Bench Verified). These adjustments do not change the task itself.
We selected agents that represent commonly used general-agent architectures and that can operate across multiple benchmarks with minimal task-specific customization. Our goal is to evaluate general-purpose agents, not agents tailored to a single benchmark. We will add more agents, and we welcome contributions from the community.
Since Exgentic focuses on evaluating generality, we chose benchmarks that cover a diverse range of domains and task types, including coding, tool use, reasoning, and interactive environments.
We started with a small set of widely used models to establish the initial leaderboard. We plan to expand the leaderboard soon with many open-weight models as well as additional closed models.
Yes. Exgentic is fully open. The evaluation pipeline, agent implementations, and configuration are available in the repository so results can be reproduced. Minor variations may occur due to model version changes or nondeterminism in LLM outputs.
Everything is open. Jump in.
Clone the repo. Evaluate your agent against the field. Results in hours, not weeks.
Get the CodePlug your evaluation environment into the Unified Protocol. Instant cross-agent testing.
IntegrateResearch updates, benchmarks, and findings from the Exgentic team.
Why pass/fail scores hide what matters most about agent systems.
Read more →How good are general purpose AI agents? We built an open evaluation framework to find out.
Read more →