About
Medaka is a bioinformatics tool created by Oxford Nanopore Technologies (ONT) Research designed to generate high-accuracy consensus sequences and variant calls from nanopore sequencing data. At its core, Medaka uses neural networks that analyze pileups of individual sequencing reads aligned against a reference sequence — typically a draft assembly or a database reference — to correct sequencing errors and call variants with high fidelity. The tool is purpose-built for the unique error profiles of Oxford Nanopore long-read sequencing, making it the go-to post-processing step in nanopore bioinformatics pipelines. Medaka provides state-of-the-art results that outperform traditional sequence-graph-based methods (such as Racon) and signal-based approaches, while also being computationally faster, making it practical even for large genomes. Key capabilities include consensus polishing of draft assemblies, haploid and diploid variant calling, and compatibility with various nanopore chemistries and basecallers. Medaka is distributed as a Python package and supports GPU acceleration for faster inference. It integrates well into established bioinformatics workflows and is widely used in research settings for genome assembly, metagenomics, and clinical sequencing applications. Medaka is ideal for computational biologists, genomics researchers, and bioinformaticians working with Oxford Nanopore sequencing data who need accurate, production-ready consensus polishing and variant calling in their pipelines.
Key Features
- Neural Network Consensus Polishing: Uses deep learning models applied to read pileups to correct errors in draft assemblies and produce high-accuracy consensus sequences.
- Variant Calling: Supports haploid and diploid variant calling directly from nanopore reads aligned to a reference sequence.
- State-of-the-Art Accuracy: Outperforms sequence-graph-based methods (e.g., Racon) and signal-based approaches in consensus accuracy benchmarks.
- GPU Acceleration: Leverages GPU hardware for faster neural network inference, enabling practical use on large genomes and datasets.
- Broad Nanopore Compatibility: Compatible with multiple ONT sequencing chemistries and basecallers, fitting into diverse nanopore bioinformatics pipelines.
Use Cases
- Polishing draft genome assemblies generated from Oxford Nanopore long reads to produce near-reference-quality consensus sequences.
- Calling single nucleotide variants (SNVs) and small indels from nanopore sequencing data in haploid or diploid organisms.
- Improving metagenomic assemblies by correcting consensus sequences from mixed nanopore sequencing samples.
- Clinical and outbreak genomics workflows requiring high-accuracy nanopore-based pathogen sequencing and variant detection.
- Integrating as a post-assembly polishing step in end-to-end nanopore bioinformatics pipelines alongside tools like Guppy, Minimap2, and Flye.
Pros
- High Accuracy: Delivers state-of-the-art consensus and variant calling results, surpassing traditional graph-based and signal-based methods.
- Fast Performance: Faster than competing methods, with optional GPU acceleration making it suitable for large-scale genomics projects.
- Open Source & Free: Freely available under an open-source license from Oxford Nanopore Technologies, with active development and community support.
- Pipeline Integration: Easily integrates into standard nanopore bioinformatics workflows as a polishing step after basecalling and assembly.
Cons
- Nanopore-Specific: Designed exclusively for Oxford Nanopore sequencing data; not applicable to Illumina or PacBio sequencing platforms.
- Requires Bioinformatics Expertise: Command-line-only tool with no GUI, requiring familiarity with bioinformatics pipelines and Linux/macOS environments.
- GPU Dependency for Speed: Optimal performance requires GPU hardware; CPU-only runs can be significantly slower on large datasets.
Frequently Asked Questions
Medaka is designed specifically for Oxford Nanopore Technologies (ONT) sequencing data. It is not compatible with short-read platforms like Illumina.
Medaka uses neural networks trained on nanopore data to model error patterns more accurately than sequence-graph methods like Racon, resulting in higher consensus accuracy while also being faster.
Yes. Medaka supports both consensus sequence polishing (correcting draft assemblies) and variant calling (haploid and diploid) against a reference sequence.
A GPU is not strictly required but is strongly recommended for production use. GPU acceleration significantly speeds up the neural network inference step, especially for large genomes.
Medaka can be installed as a Python package via pip or conda, and is also available as a Docker/Singularity container for HPC environments.
