About
VoxCeleb is an open-source audio-visual dataset developed by the Visual Geometry Group (VGG) at the University of Oxford. It consists of two versions: VoxCeleb1, with over 150,000 utterances from 1,251 celebrities, and VoxCeleb2, with over 1,000,000 utterances from 6,112 celebrities. The data was collected entirely from YouTube interview videos using a fully automated pipeline involving active speaker verification via a two-stream synchronization CNN and identity confirmation via facial recognition. All speech samples are captured 'in the wild,' meaning they include background chatter, laughter, overlapping speech, pose variation, and varying lighting conditions — making the dataset highly representative of real-world audio challenges. Speakers span a wide range of ethnicities, accents, professions, ages, and nationalities, and each audio segment is at least 3 seconds long, totaling over 2,000 hours of content. VoxCeleb is primarily used for training and benchmarking speaker recognition, speaker verification, and speaker identification models. It has enabled significant advances in deep learning-based audio research and serves as a standard benchmark in the speaker recognition community. Researchers, data scientists, and AI engineers working on voice biometrics, speaker diarization, and audio-visual learning will find this dataset invaluable.
Key Features
- Massive Scale: Over 1 million utterances from 7,000+ speakers across VoxCeleb1 and VoxCeleb2, totaling more than 2,000 hours of audio-visual content.
- In-the-Wild Conditions: All speech segments are captured under real-world noisy conditions including background chatter, laughter, overlapping speech, and lighting variation.
- Automated Data Pipeline: Dataset was built using a fully automated pipeline combining active speaker verification CNNs and facial recognition, eliminating the need for manual annotation.
- Diverse Speaker Demographics: Speakers span a wide range of ethnicities, accents, professions, ages, and nationalities, enabling robust and generalizable model training.
- Dual Dataset Versions: VoxCeleb1 (150K+ utterances, 1,251 speakers) and VoxCeleb2 (1M+ utterances, 6,112 speakers) offer scalable options for different research needs.
Use Cases
- Training deep learning models for speaker verification and authentication systems.
- Benchmarking and comparing speaker recognition algorithms under noisy, real-world conditions.
- Researching audio-visual speech processing and multi-modal learning techniques.
- Developing voice biometrics systems for security and identity verification applications.
- Studying speaker diarization to identify and separate multiple speakers in audio recordings.
Pros
- Freely Available: The dataset is open and freely accessible for academic and research use, lowering barriers to entry for speaker recognition research.
- Real-World Diversity: Collected from YouTube in unconstrained conditions, making models trained on it more robust to real-world audio challenges.
- Extensively Cited Benchmark: VoxCeleb is a widely adopted benchmark in the speaker recognition community, enabling fair comparison across research papers and models.
Cons
- Privacy Concerns: As data is sourced from public videos of real celebrities, there are inherent privacy considerations — a Dataset Privacy Notice is provided but restrictions may apply.
- Limited to Celebrity Speakers: All speakers are public figures/celebrities from YouTube, which may introduce demographic bias and limit generalization to non-celebrity voice populations.
- No Built-In Tooling: The dataset is a raw data resource with no built-in processing tools or APIs, requiring users to implement their own data loading and preprocessing pipelines.
Frequently Asked Questions
VoxCeleb is primarily used for training and evaluating speaker recognition, speaker verification, and speaker identification models in deep learning research.
VoxCeleb1 contains over 150,000 utterances from 1,251 speakers, while VoxCeleb2 is significantly larger with over 1,000,000 utterances from 6,112 speakers.
Yes, VoxCeleb is freely available for academic and research use. Users should consult the Dataset Privacy Notice for any restrictions.
The dataset was built using a fully automated pipeline that downloads interview videos from YouTube, verifies active speakers using a synchronization CNN, and confirms identities using facial recognition.
You should cite the relevant publications: Nagrani et al. (2019) for the general dataset overview, Chung et al. (2018) for VoxCeleb2, and Nagrani et al. (2017) for the original VoxCeleb1 INTERSPEECH paper.