About
The Open Catalyst Project (OCP) is an open-science initiative jointly developed by Meta AI's Fundamental AI Research (FAIR) team and Carnegie Mellon University's Department of Chemical Engineering. Its core mission is to leverage artificial intelligence to dramatically accelerate the discovery of cost-effective catalysts needed for renewable energy storage — a key bottleneck in addressing climate change. Traditionally, evaluating new catalyst structures requires expensive quantum mechanical simulations (density functional theory, or DFT). OCP uses machine learning models to efficiently approximate these calculations, enabling researchers to screen far more candidate structures than would otherwise be possible. The project has released two landmark datasets: OC20, containing over 1.2 million molecular relaxations derived from 250+ million DFT calculations, and OC22, focused on oxide electrocatalysis. A newer dataset, OpenDAC, targets Metal-Organic Frameworks (MOFs) for Direct Air Capture applications. All datasets, baseline ML models, and code are openly available on GitHub. OCP also hosts competitive leaderboards and has run annual challenges at NeurIPS, inviting the global research community to submit and benchmark new models. An interactive demo allows users to visualize molecular relaxation trajectories for various catalyst-adsorbate systems. This project is ideal for AI researchers, computational chemists, materials scientists, and academic institutions working at the intersection of machine learning and clean energy.
Key Features
- Large-Scale Open Datasets: Access OC20, OC22, and OpenDAC datasets totaling 1.3 million molecular relaxations from over 260 million DFT calculations, freely available for ML model training.
- Open-Source Baseline ML Models: Pre-trained machine learning models and full source code are available on GitHub, enabling researchers to benchmark and build upon state-of-the-art catalyst prediction methods.
- Interactive Molecular Demo: A browser-based interactive demo lets users visualize relaxation trajectories of adsorbates on catalyst surfaces, making complex simulations accessible without local compute.
- Community Leaderboard & Challenges: A public leaderboard tracks model performance, with annual NeurIPS competition challenges that bring the global research community together to push the state of the art.
- OpenDAC for Direct Air Capture: A dedicated dataset and models focused on Metal-Organic Frameworks (MOFs) for CO₂ Direct Air Capture, extending OCP's scope beyond electrocatalysis.
Use Cases
- Computational chemistry researchers training ML models to predict molecular adsorption energies and catalyst surface relaxations without expensive DFT calculations.
- Materials scientists screening large libraries of catalyst candidates for renewable energy applications such as hydrogen production and CO₂ reduction.
- AI/ML researchers benchmarking graph neural networks and other architectures on large-scale atomistic simulation datasets via the public leaderboard.
- Climate and energy research groups developing cost-effective catalysts for electrolysis, fuel cells, and direct air capture of carbon dioxide.
- University labs and academic institutions using OC20/OC22 as a benchmark dataset for graduate research in machine learning for molecular modeling.
Pros
- Completely Free and Open Source: All datasets, code, and baseline models are publicly available at no cost, lowering barriers for academic researchers and institutions worldwide.
- Massive, High-Quality Datasets: Over 260 million DFT calculations back the datasets, providing an unprecedented scale of ground-truth quantum chemistry data for ML training.
- Cross-Disciplinary Research Impact: Bridges AI and materials science, enabling chemists without deep ML expertise and ML researchers without chemistry backgrounds to collaborate effectively.
- Active Research Community: Regular dataset releases, NeurIPS challenges, and leaderboard submissions foster a vibrant, competitive research ecosystem around catalyst discovery.
Cons
- Highly Specialized Domain: The project is narrowly focused on catalysis and renewable energy; it has little applicability outside of computational chemistry and materials science research.
- Requires Technical Expertise: Effectively using the datasets and models requires familiarity with machine learning, density functional theory, and computational chemistry — steep for newcomers.
- No Managed Cloud Service: There is no hosted inference or managed API; users must set up their own compute infrastructure to train or run models at scale.
Frequently Asked Questions
It is a joint research initiative by Meta AI (FAIR) and Carnegie Mellon University that uses machine learning to approximate quantum mechanical simulations, enabling faster discovery of catalysts for renewable energy storage.
Yes. All datasets (OC20, OC22, OpenDAC), baseline models, and code are released openly and are free to download, use, and build upon for research purposes.
OC20 (Open Catalyst 2020) is a large-scale dataset containing over 1.2 million molecular relaxations backed by approximately 250 million DFT calculations, designed for training ML models to predict catalyst behavior.
OpenDAC is a newer dataset and set of models released in 2023, focused on Metal-Organic Frameworks (MOFs) for Direct Air Capture (DAC) of CO₂, extending the project's scope to carbon removal technologies.
Researchers can submit model predictions to the public evaluation server and track results on the leaderboard. Annual challenges are hosted at NeurIPS; check the project website and GitHub for submission guidelines and deadlines.
