MT Bench

open_source

MT Bench is an open-source multi-turn benchmark for evaluating large language models using GPT-4 as an automated judge. Part of the FastChat ecosystem by lm-sys.

AI Models & Infrastructure

Research & Education

LLM Developer Tools

About

MT Bench (Multi-Turn Benchmark) is a rigorous open-source evaluation toolkit developed by the lm-sys team, part of the FastChat ecosystem. It provides a curated set of challenging multi-turn, open-ended questions designed specifically to evaluate the conversational quality and reasoning abilities of LLM-powered chat assistants. Rather than relying on simple automated metrics, MT Bench uses the LLM-as-a-judge paradigm — prompting powerful models like GPT-4 to assess and score model responses, enabling nuanced quality measurement at scale. The benchmark covers diverse topic categories including reasoning, math, coding, STEM, writing, and roleplay, making it a comprehensive tool for holistic model evaluation. Researchers and developers can generate model answers via API or local inference, run GPT-4-based judgments, view pre-generated answers from popular models, compute inter-annotator agreement, and visualize results through an interactive browser UI. MT Bench has become a de facto standard in the open-source LLM community for comparing model capabilities, and its leaderboard is widely referenced in academic papers and model releases. It is particularly useful for AI researchers, ML engineers, and teams developing or fine-tuning chat-based language models who need reproducible, automated evaluation pipelines.

Key Features

Multi-Turn Question Set: A curated set of challenging multi-turn, open-ended questions across diverse categories including reasoning, math, coding, writing, and STEM.
LLM-as-a-Judge Scoring: Automates evaluation by prompting GPT-4 (or other strong LLMs) to act as judges and produce quality scores for model responses.
Pre-Generated Model Answers: Provides downloadable pre-generated answers from popular LLMs so users can review and compare baseline model performance instantly.
Agreement Computation: Includes tools to compute inter-annotator agreement between human raters and LLM judges for deeper reliability analysis.
Interactive Results Browser: A QA browser UI lets users explore model answers, judgments, and benchmark results in a visual, interactive interface.

Use Cases

Researchers benchmarking new open-source language models against existing baselines to report results in academic papers.
ML engineers fine-tuning chat models who need automated, reproducible evaluation pipelines to track improvement across training runs.
AI teams comparing proprietary and open-source LLMs on conversational quality before selecting a model for production deployment.
Organizations running internal LLM evaluations to measure how well custom fine-tuned models perform on multi-turn dialogue tasks.
Academics studying inter-annotator agreement and the reliability of LLM-as-a-judge evaluation methodologies.

Pros

Automated & Scalable Evaluation: Replaces slow and expensive human annotation with GPT-4-based judging, enabling fast and reproducible LLM benchmarking at scale.
Comprehensive Multi-Turn Coverage: Tests models across diverse, realistic multi-turn scenarios rather than simple single-turn tasks, offering a more complete assessment of chat quality.
Open Source & Community-Adopted: Freely available on GitHub and widely used in academic and industry research, making results directly comparable to published leaderboards.
Flexible Model Integration: Supports both API-based models and locally served models, making it adaptable to a wide range of deployment setups.

Cons

Requires GPT-4 API Access for Judging: The LLM-as-a-judge pipeline depends on access to GPT-4, which incurs API costs and introduces a dependency on OpenAI's infrastructure.
Limited to Text-Based Chat Tasks: MT Bench focuses exclusively on chat assistant evaluation and does not cover multimodal, retrieval-augmented, or tool-use scenarios.
Technical Setup Required: Running MT Bench requires familiarity with Python, CLI tools, and model serving infrastructure, making it inaccessible to non-technical users.

Frequently Asked Questions

MT Bench is a multi-turn open-ended question benchmark developed by the lm-sys team to evaluate the quality of LLM-powered chat assistants using GPT-4 as an automated judge.

Instead of human annotators, MT Bench prompts a strong LLM like GPT-4 to read model responses and assign quality scores, making evaluation faster, cheaper, and reproducible.

Yes, MT Bench is fully open source and available on GitHub at no cost. However, running GPT-4 as a judge requires an OpenAI API key and will incur API usage costs.

MT Bench supports any model that can be served via a compatible API or locally using FastChat, including Vicuna, LLaMA, GPT series, and other open or closed-source models.

The repository provides a download script for pre-generated model answers and judgments, and results are published on the Chatbot Arena leaderboard maintained by lm-sys.