PrimateAI

open_source

PrimateAI is an open-source deep residual neural network by Illumina that scores the pathogenicity of missense variants using primate sequence data.

Data & Analytics

AI Models & Infrastructure

AI Research Tools

About

PrimateAI is a deep residual neural network created by Illumina for classifying the pathogenicity of missense mutations in genomic research. It leverages a semi-supervised training approach using approximately 380,000 common missense variants from humans and six non-human primate species, treating common variants in these species as likely benign. The model takes as input the amino acid sequence flanking a variant of interest along with orthologous sequence alignments across species — no additional human-engineered features are required. The output is a continuous pathogenicity score from 0 (less pathogenic) to 1 (more pathogenic), enabling researchers to prioritize potentially disease-causing variants. To incorporate protein structural information, PrimateAI includes sub-networks that learn to predict secondary structure and solvent accessibility directly from amino acid sequences. This design allows the model to capture biologically meaningful context without requiring external structural databases. PrimateAI is particularly useful for clinical genomics, rare disease diagnosis, and variant prioritization pipelines. It is designed for bioinformaticians, computational biologists, and genomics researchers working with whole-genome or whole-exome sequencing data. The repository is archived and read-only as of April 2026, but the model and code remain publicly accessible for research use under Illumina's open-source license.

Key Features

Primate-Trained Pathogenicity Scoring: Trained on ~380,000 missense variants from humans and six non-human primate species using a semi-supervised benign vs. unlabeled learning strategy.
Sequence-Only Input: Accepts amino acid sequences flanking the variant and cross-species orthologous alignments — no manually engineered features needed.
Integrated Protein Structure Sub-Networks: Predicts secondary structure and solvent accessibility from sequence data, embedding structural context directly into the pathogenicity model.
Continuous Pathogenicity Score: Outputs a score from 0 (benign) to 1 (pathogenic), enabling fine-grained variant prioritization for downstream clinical or research analysis.
Open-Source & Reproducible: Fully open-source under Illumina's license on GitHub, enabling academic and clinical researchers to reproduce results and integrate the model into custom pipelines.

Use Cases

Prioritizing candidate pathogenic variants in whole-exome or whole-genome sequencing studies for rare disease diagnosis.
Filtering and ranking missense variants in clinical genomics pipelines to focus on high-impact mutations.
Benchmarking and comparing variant effect predictors in computational biology research.
Training or fine-tuning downstream models using PrimateAI scores as features or labels.
Supporting variant interpretation workflows in academic research on protein function and disease mechanisms.

Pros

No Feature Engineering Required: The model learns directly from raw sequence data, eliminating the need for manually curated biological features and reducing implementation complexity.
Large & Diverse Training Dataset: Trained on hundreds of thousands of variants across multiple primate species, providing strong signal for distinguishing pathogenic from benign variants.
Embeds Structural Biology: Built-in sub-networks for secondary structure and solvent accessibility give the model biologically grounded context without external structural data dependencies.
Peer-Validated Research Tool: Developed and published by Illumina, a leading genomics company, lending credibility and scientific rigor to the model's predictions.

Cons

Repository Archived: As of April 2026, the GitHub repository is archived and read-only, meaning no new updates, bug fixes, or community contributions will be accepted.
Focused Scope: Designed specifically for missense mutation pathogenicity; it does not cover other variant types such as insertions, deletions, or splice site mutations.
Requires Bioinformatics Expertise: Intended for researchers with computational biology backgrounds; non-technical users may find setup and integration into genomics pipelines challenging.

Frequently Asked Questions

PrimateAI is specifically designed for missense mutations — single amino acid changes in a protein sequence. It does not natively handle indels, nonsense, or splice variants.

The model outputs a score between 0 and 1. Scores closer to 1 indicate a higher likelihood that the variant is pathogenic (disease-causing), while scores closer to 0 suggest the variant is more likely benign.

No. PrimateAI learns to predict secondary structure and solvent accessibility directly from amino acid sequences using internal sub-networks, so no external protein structure databases are required.

No. The repository was archived by Illumina on April 20, 2026, and is now read-only. The code and model weights remain publicly available for use but will not receive further updates.

It was trained using a semi-supervised approach on approximately 380,000 common missense variants from humans and six non-human primate species. Common variants in these species were treated as likely benign, providing a large natural training signal without requiring extensive labeled pathogenic data.