About
METR (Model Evaluation & Threat Research) is an independent nonprofit dedicated to evaluating the capabilities and risks of frontier AI systems. Their mission is to advance public understanding of AI through rigorous, third-party assessments that help guide safe deployment of increasingly powerful models. METR is best known for pioneering the 'time horizon' metric — a measure of how long a software task an AI agent can reliably complete — which they have tracked across models from 2019 to the present, demonstrating consistent exponential growth. They publish detailed evaluation reports for models including GPT-5, Claude 3.7, DeepSeek-R1, and OpenAI o3, often in formal partnership with AI developers. Key research areas include autonomous capability assessments, AI's ability to accelerate AI R&D, evaluation integrity (e.g., sandbagging and reward hacking via the MALT dataset), developer productivity impacts, and frontier AI safety policy analysis. METR also maintains a comprehensive index of autonomous AI capability research and guidance. Their work is freely available and intended for AI researchers, policy makers, AI safety professionals, and enterprises seeking independent assessments of AI risk. METR does not accept compensation for evaluations, ensuring independence. They are a foundational resource for anyone tracking the trajectory of frontier AI capabilities and safety.
Key Features
- Time Horizon Metric: Measures AI performance by the length of software tasks agents can reliably complete, tracking exponential growth across models from 2019 to present.
- Frontier Model Evaluation Reports: Publishes detailed autonomous capability assessments for leading models (GPT-5, Claude 3.7, o3, DeepSeek-R1) often in formal partnership with AI developers.
- MALT Dataset: A curated dataset of natural and prompted examples of behaviors that threaten evaluation integrity, including reward hacking and sandbagging.
- AI Safety Policy Analysis: Analyzes shared components across published frontier AI safety policies, covering capability thresholds, model weight security, and deployment mitigations.
- Developer Productivity Research: Empirical studies measuring real-world AI impact, such as finding that AI tools in early 2025 made experienced developers 19% slower on average.
Use Cases
- AI safety researchers tracking the growth of autonomous AI capabilities over time using the time horizon metric.
- Policy makers and regulators reviewing frontier AI safety policies and risk transparency standards.
- AI developers benchmarking their models against third-party capability evaluations for safety assessments.
- Enterprises and institutions needing independent, unbiased assessments of AI risk before deployment decisions.
- Academics studying evaluation integrity issues such as sandbagging and reward hacking using the MALT dataset.
Pros
- Truly Independent: METR does not accept compensation for evaluations, ensuring unbiased, third-party assessments of frontier AI capabilities.
- Freely Available Research: All evaluation reports, datasets, and research papers are publicly accessible at no cost, benefiting the broader AI safety community.
- Rigorous Methodology: Evaluations use systematic, reproducible methods with versioned task suites, enabling longitudinal tracking of AI capability growth.
Cons
- Not a Commercial Tool: METR is a research organization, not a software product — there is no API or platform to integrate into existing workflows.
- Narrow Scope: Focus is exclusively on frontier AI evaluation and safety research, so it offers limited utility for practitioners outside AI safety or policy domains.
Frequently Asked Questions
METR (Model Evaluation & Threat Research) is an independent nonprofit that evaluates the autonomous capabilities and risks of frontier AI systems, publishing research and evaluation reports for the public.
The time horizon metric measures AI performance in terms of the length of software tasks an AI agent can successfully complete. METR has tracked this metric since 2019 and found consistent exponential growth across frontier models.
No. METR explicitly does not accept compensation for evaluation work, ensuring complete independence from the AI developers whose models they assess.
METR has published evaluation reports for models including GPT-5, GPT-4o, Claude 3.7 Sonnet, Claude 3.5 Sonnet, OpenAI o3, o4-mini, o1-preview, DeepSeek-R1, DeepSeek-V3, and others.
METR's research is intended for AI safety researchers, policymakers, AI developers, and enterprises seeking independent, rigorous assessments of frontier AI capabilities and risks.
