Glow

open_source

Glow is an open-source Apache Spark-based toolkit for biobank-scale genomic data processing, statistical analysis, and machine learning. Supports VCF, BGEN, Python, SQL, R, and more.

Data & Analytics

AI Frameworks

AI Research Tools

About

Glow is an open-source genomics toolkit designed to handle biobank-scale genomic data using the power of Apache Spark—the industry-leading engine for big data processing and machine learning. Built for researchers, bioinformaticians, and data engineers working with large-scale genetic datasets, Glow eliminates the need to learn new APIs by extending Spark SQL with genomics-specific capabilities. The toolkit supports loading data from common genomic file formats, including VCF and BGEN, directly into Spark DataFrames. It provides built-in functions for quality control, data manipulation, variant normalization, and liftover. Regression functions and integration with Spark ML libraries enable population stratification and advanced statistical analyses right out of the box. Glow's flexibility is a key strength: users can write queries in Python, SQL, R, Java, or Scala, and combine genomic datasets with electronic health records, real-world evidence, and medical imaging data. It also includes utilities for piping DataFrames through existing command-line tools and Pandas functions, making it easy to parallelize legacy pipelines. Designed for cloud-scale deployments, Glow is ideal for pharmaceutical research organizations, academic genomics centers, and biotech startups running population-scale genomics workflows. Its open-source nature and active community make it a reliable, extensible foundation for tertiary genomics analysis pipelines.

Key Features

Apache Spark Native Integration: Built directly on Apache Spark, enabling petabyte-scale genomic data processing using familiar Spark SQL APIs without any new learning curve.
Multi-Format Genomic Data Support: Natively reads VCF and BGEN files into Spark DataFrames alongside standard big data formats for seamless interoperability.
Quality Control & Variant Normalization: Includes built-in functions for data QC, manipulation, variant normalization, and liftover to streamline pre-analysis data preparation.
Statistical & ML Analysis: Provides regression functions and Spark ML library integration for population stratification, GWAS, and large-scale machine learning workflows.
Multi-Language API Support: Write genomics queries and pipelines in Python, SQL, R, Java, or Scala, enabling teams to work in their preferred language.

Use Cases

Running genome-wide association studies (GWAS) at biobank scale using distributed Spark compute clusters.
Combining genomic variant data with electronic health records for population health research.
Performing quality control, normalization, and liftover on large VCF or BGEN genomic datasets as part of a cloud-based data pipeline.
Applying machine learning models for population stratification and genomic risk scoring using Spark ML integration.
Parallelizing existing Pandas-based or command-line genomics tools across a distributed cluster to accelerate legacy workflows.

Pros

True Biobank-Scale Processing: Leverages Apache Spark's distributed computing to handle genomic datasets at petabyte scale, far exceeding the limits of traditional tools.
No New APIs to Learn: Existing Spark users can immediately use Glow's genomics extensions without adopting an entirely new programming model.
Broad Ecosystem Integration: Seamlessly combines genomic data with EHR, real-world evidence, and medical imaging datasets, enabling multi-modal health research.
Open Source & Community-Driven: Freely available under an open-source license with an active community forum and Slack channel for support and contributions.

Cons

Requires Apache Spark Infrastructure: Users must set up and manage a Spark environment (cloud or on-premise), which adds operational complexity for smaller teams.
Steep Learning Curve for Non-Spark Users: Researchers unfamiliar with distributed computing concepts may find Spark's programming model challenging to adopt before benefiting from Glow.
No Graphical User Interface: Glow is entirely code-driven with no GUI, making it less accessible for non-technical biologists or clinical researchers.

Frequently Asked Questions

Glow is an open-source toolkit built on Apache Spark that enables large-scale genomic data processing, statistical analysis, and machine learning at biobank-scale and beyond.

Glow natively supports VCF and BGEN genomic file formats, as well as common big data standards compatible with the Apache Spark ecosystem.

Glow supports Python, SQL, R, Java, and Scala via the native Spark SQL APIs, so you can use whichever language fits your workflow.

No. If you're already familiar with Apache Spark, Glow extends Spark SQL with genomics-specific functions and datasources, so you can get started immediately.

Yes. Glow allows you to combine genomic data with electronic health records, real-world evidence, medical images, and other datasets using the same Spark APIs.