SWE-bench

free

SWE-bench is the gold-standard benchmark for evaluating AI agents on real-world GitHub issues. Compare LLMs and coding agents on the public leaderboard.

AI Models & Infrastructure

Testing & QA Tools

AI Research Tools

About

SWE-bench is the leading evaluation framework for assessing the capability of AI-powered software engineering agents to autonomously resolve real-world GitHub issues. Developed by the SWE-bench Team and supported by institutions including Anthropic, OpenAI, AWS, and Open Philanthropy, the benchmark draws from hundreds of open-source Python repositories and measures the percentage of issues an agent can fully resolve. The platform offers several benchmark variants to suit different evaluation needs: SWE-bench Full (2,294 instances), SWE-bench Verified (500 human-filtered instances), SWE-bench Lite (300 curated for cost-efficient evaluation), SWE-bench Multilingual (300 tasks across 9 programming languages), and SWE-bench Multimodal (issues involving visual elements). A public leaderboard allows side-by-side comparison of LLM agents, open-source scaffolds, and proprietary systems by resolve rate, cost, and step efficiency. Beyond the leaderboard, the SWE-bench ecosystem includes tooling such as SWE-agent (the reference open-source agent), mini-SWE-agent, SWE-smith for training custom models, and a Dockerized evaluation harness for reproducible local testing. It has become the de facto standard for measuring progress toward fully autonomous AI software development, making it indispensable for AI researchers, ML engineers, and anyone building or evaluating coding agents.

Key Features

Multi-Variant Benchmarks: Offers Full, Verified, Lite, Multilingual, and Multimodal benchmark sets to support evaluation at different scales, costs, and modalities.
Public Leaderboard: Live, filterable leaderboard that ranks AI agents and LLMs by percentage of issues resolved, cost, and step efficiency.
Open-Source Evaluation Harness: Dockerized, reproducible evaluation environment and CLI tools so any team can run official SWE-bench evaluations locally.
Multi-Language & Multimodal Coverage: SWE-bench Multilingual spans 9 programming languages; SWE-bench Multimodal includes issues with visual elements like screenshots and diagrams.
Agent Training Toolkit: SWE-smith enables researchers to generate training data and fine-tune their own software engineering agents from scratch.

Use Cases

AI researchers benchmarking the software engineering capabilities of new large language models.
ML engineering teams comparing their custom coding agents against state-of-the-art systems on a public leaderboard.
Academic labs studying autonomous software engineering and agent-based code repair.
AI companies validating the real-world performance of proprietary models before release.
Developers training custom SWE agents using the SWE-smith data generation toolkit.

Pros

Industry-Standard Credibility: Backed by Anthropic, OpenAI, and leading research institutions, making SWE-bench results universally recognized in AI research.
Diverse Benchmark Variants: Multiple specialized subsets (Lite, Verified, Multilingual, Multimodal) allow teams to balance evaluation depth against cost and scope.
Fully Open and Reproducible: All datasets, tooling, and evaluation harnesses are open-source and Dockerized, ensuring transparent and reproducible evaluations.

Cons

Primarily Python-Focused (Full Benchmark): The original full benchmark centers on Python GitHub repositories, limiting coverage for other language ecosystems outside the Multilingual variant.
High Evaluation Cost: Running agents on the full 2,294-instance benchmark can be computationally expensive; teams often resort to the Lite subset to manage costs.

Frequently Asked Questions

SWE-bench is a benchmark that evaluates how well AI language models and agents can resolve real software engineering issues pulled from popular open-source GitHub repositories.

SWE-bench Full contains 2,294 instances; Verified is a human-filtered subset of 500 high-quality instances; Lite is a curated 300-instance subset designed for faster and cheaper evaluation.

Yes. SWE-bench is a free, open-source research benchmark. The datasets, leaderboard, and evaluation tooling are all publicly available on GitHub and HuggingFace.

You can submit your agent's results via the official submission form on swebench.com. The leaderboard accepts results from both open-source and proprietary agent systems.

SWE-bench Multilingual is a variant featuring 300 real-world software engineering tasks spanning 9 programming languages, extending evaluation beyond Python to broader language ecosystems.