MarginPolish

open_source

MarginPolish is an open-source graph-based assembly polisher for Oxford Nanopore sequencing data. It improves assembly accuracy by finding multiple alignment paths across run-length-encoded reads.

Coding & Development

Research & Education

AI Research Tools

About

MarginPolish is a graph-based genome assembly polisher developed by the UCSC Nanopore Computational Genomics Lab and released under the MIT license. Designed specifically for Oxford Nanopore Technology (ONT) long-read sequencing data, it accepts a FASTA assembly and an indexed BAM file of aligned nanopore reads as input and produces a polished, higher-accuracy FASTA assembly as output. The tool works by iteratively finding multiple probable alignment paths for run-length-encoded reads across the assembly graph. This probabilistic approach allows MarginPolish to resolve ambiguities common in long-read alignments—particularly homopolymer errors characteristic of ONT data—and generate a refined consensus sequence that better represents the true genomic sequence. MarginPolish is both a capable standalone polisher and a critical component in a complete genome assembly pipeline alongside Shasta (an ultrafast nanopore assembler) and HELEN (a multi-task RNN polisher). MarginPolish generates feature images that HELEN uses for a second, deep-learning-based polishing pass, achieving near-reference-quality assemblies from raw nanopore reads. Built in C with CMake, MarginPolish supports Docker-based deployment for reproducible research environments. It is best suited for bioinformaticians and computational biologists working on de novo genome assembly projects, variant calling pipelines, and nanopore-based sequencing studies in academic or enterprise research settings.

Key Features

Graph-Based Probabilistic Polishing: Iteratively finds multiple probable alignment paths through the assembly graph to resolve ambiguities and generate a refined consensus sequence.
Run-Length Encoding Support: Natively handles run-length-encoded nanopore reads to effectively correct the homopolymer errors characteristic of ONT sequencing.
HELEN Feature Image Generation: Produces polishing feature images compatible with the HELEN multi-task RNN polisher for a two-stage deep-learning-enhanced polishing pipeline.
Standalone or Pipeline Mode: Functions as a standalone assembly polisher or as part of the full Shasta + MarginPolish + HELEN genome assembly pipeline.
Docker Support: Includes Docker configurations for reproducible, environment-independent deployment in research and production genomics workflows.

Use Cases

Polishing de novo genome assemblies generated from Oxford Nanopore Technology long-read sequencing experiments
Running as the central component in the Shasta + MarginPolish + HELEN end-to-end genome assembly pipeline
Improving consensus accuracy in nanopore-based bacterial, fungal, or eukaryotic genome assembly projects
Generating polishing feature images for downstream deep-learning-based refinement with the HELEN RNN polisher
Enabling reproducible genome assembly research through Docker-based deployment in shared computing environments

Pros

Open Source & Freely Licensed: Released under the MIT license, making it accessible for both academic research and commercial genomics applications at no cost.
Purpose-Built for Nanopore Data: Specifically engineered to handle the characteristics of ONT long reads, including systematic homopolymer errors, unlike general-purpose polishers.
Seamless Pipeline Integration: Designed to work natively with Shasta and HELEN, enabling a complete, state-of-the-art end-to-end genome assembly workflow.

Cons

Nanopore-Only Compatibility: Designed exclusively for Oxford Nanopore Technology reads and cannot be applied to Illumina short reads or PacBio long reads.
Steep Learning Curve: Requires significant bioinformatics expertise and command-line proficiency to install, configure, and interpret results correctly.
Limited GUI or Documentation: As a research-grade command-line tool, it lacks a graphical interface and may have sparse documentation compared to commercial alternatives.

Frequently Asked Questions

MarginPolish requires a FASTA assembly file and an indexed BAM file containing Oxford Nanopore Technology reads aligned to that assembly.

It outputs a polished FASTA assembly with improved consensus accuracy, and optionally generates feature images for downstream polishing with the HELEN RNN polisher.

No. MarginPolish is specifically designed for Oxford Nanopore Technology long reads and is not intended for short-read sequencing or PacBio data.

Yes. MarginPolish is released under the MIT license, which allows free use for both academic research and commercial applications.

Shasta produces the initial de novo assembly from raw nanopore reads. MarginPolish refines that assembly using graph-based alignment and generates feature images. HELEN then uses those images for a second round of deep-learning-based polishing to achieve near-reference-quality results.