About
scGPT is the official implementation of the research paper 'scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.' It leverages transformer-based generative AI architectures to learn rich biological representations from single-cell RNA sequencing and multi-omics data. The library provides multiple pretrained model checkpoints trained on large corpora of single-cell data, enabling researchers to fine-tune or apply the model zero-shot to a wide range of downstream tasks. Key capabilities include cell type annotation, gene regulatory network inference, perturbation response prediction, multi-batch integration, and zero-shot cell embedding for clustering and similarity tasks. The repository includes detailed Jupyter notebook tutorials covering both fine-tuning workflows and zero-shot applications, making it accessible to researchers at various levels of machine learning expertise. scGPT also offers experimental integration with HuggingFace Transformers, broadening its compatibility with the broader ML ecosystem. Released under the MIT license, it is fully open source and actively maintained, with an engaged community contributing via GitHub Issues and Discussions. It is primarily targeted at computational biologists, bioinformaticians, and AI researchers working on genomics, transcriptomics, and multi-omics problems.
Key Features
- Pretrained Foundation Model Checkpoints: Multiple pretrained scGPT checkpoints are available for diverse single-cell tasks, including a continual pretrained model optimized for cell embedding.
- Zero-Shot Applications: Supports zero-shot inference for cell clustering, embedding, and similarity tasks without requiring task-specific fine-tuning.
- Fine-Tuning for Downstream Tasks: Provides workflows for fine-tuning on tasks such as cell type annotation, gene regulatory network inference, and multi-batch data integration.
- HuggingFace Integration: Experimental support for running pretraining and inference workflows via HuggingFace Transformers, enabling broader model ecosystem compatibility.
- Comprehensive Tutorials: Jupyter notebook tutorials cover end-to-end workflows for both fine-tuning and zero-shot use cases, lowering the barrier for computational biology researchers.
Use Cases
- Annotating cell types in single-cell RNA sequencing datasets using pretrained or fine-tuned scGPT models.
- Predicting gene perturbation responses to model the effects of genetic or chemical interventions at the single-cell level.
- Inferring gene regulatory networks from single-cell transcriptomics data.
- Integrating multi-batch single-cell datasets to remove batch effects and harmonize large-scale omics data.
- Generating zero-shot cell embeddings for unsupervised clustering, visualization, and cross-dataset comparison.
Pros
- Open Source with MIT License: Fully open source under the MIT license, allowing unrestricted use, modification, and distribution in both academic and commercial contexts.
- Strong Research Foundation: Backed by a peer-reviewed publication and pretrained on large-scale single-cell data, providing high-quality biological representations out of the box.
- Active Community and Development: Actively maintained with regular updates, new checkpoints, and a growing community contributing via GitHub Issues and Discussions.
Cons
- Requires Computational Resources: Training and fine-tuning foundation models on large single-cell datasets demands significant GPU memory and compute, which may be a barrier for some researchers.
- Steep Learning Curve: Effective use requires familiarity with both deep learning concepts and single-cell bioinformatics, limiting accessibility for wet-lab biologists.
- HuggingFace Integration Still Experimental: The HuggingFace pretraining workflow is preliminary and not yet merged into the main branch, meaning it may be unstable for production use.
Frequently Asked Questions
scGPT is an open-source foundation model for single-cell multi-omics analysis. It uses generative AI and transformer architectures to learn biological representations from single-cell RNA sequencing data, enabling tasks like cell annotation, perturbation prediction, and gene network inference.
Yes. scGPT supports zero-shot applications, allowing you to use pretrained checkpoints directly for tasks such as cell embedding, clustering, and similarity analysis without additional training.
Multiple pretrained checkpoints are available, including general-purpose models and a continual pretrained model specifically optimized for cell embedding tasks. Details are provided in the README under the 'Pretrained scGPT checkpoints' section.
Preliminary HuggingFace integration is available on the 'integrate-huggingface-model' branch for running pretraining workflows, though it is still experimental and not yet merged into the main branch.
scGPT is released under the MIT license, making it freely available for academic research, commercial use, and open-source contributions.
