Your browser is out of date

Update your browser to view this website correctly. Update my browser now

×

Hail on Cloudera Quickstart

Solutions Gallery > Hail on Cloudera Quickstart

Solution overview

MetiStream in partnership with Cloudera, has developed an implementation offering for Hail (https://hail.is/ and https://github.com/hail-is/hail) built on Cloudera’s CDH big data platform which incorporates the power of Apache Spark.  Hail is an open-source genomic processing tool designed by a team of researchers from the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital.  Featuring a suite of built-in tools for quality control, genomic annotations, and statistical analysis, Hail enables you to quickly glean insights from massive amounts of data. Our solution also provides customers with the ability to pass their genomic data through to Cloudera’s Spark environment, making available the host of Machine Learning (ML) packages designed for Apache Spark. As a result, this service offering provides customers with a comprehensive, fast and cost efficient approach and framework for supporting the downstream whole genome pipeline process.  

As a key feature of this offering, Hail’s suite of genomic quality control measures includes:

Sample QC (Quality control methods applied to each person’s data)

  • Chromosomal Anomalies/Sex Inconsistencies
  • Sample Relatedness
  • Population Substructure
  • Sample Genotyping Efficiency (Call Rate, etc.)
  • Identity by Descent (IBD) Pruning

Variant QC (Quality control methods applied to genomic markers)

  • Hardy Weinberg annotations
  • Flagging Mendelian errors on HapMap control samples
  • Marker Genotyping Efficiency (Call Rate, etc.)
  • Linkage Disequilibrium (LD) Pruning

After quality control, Hail provides tools for easily annotating or filtering your data using its built-in expression language. Annotations can be made from in-house tables, or generated from public datasources such as gnomAD.  Hail also possesses an analytical suite which includes algorithms such as Principal Components Analysis (PCA) for computing sample scores and linear/mixed regressions for estimating heritability. As well, Hail’s native VDS format can be formatted to be compatible with SQL queries using Cloudera’s Impala and Hive tools. If your analytical pipeline includes Machine Learning, our solution allows you to easily port any genomic data in Hail to the Spark environment for more robust analytics.

Required Capabilities

  • General: Spark 2.1 +, Python 2.7
  • Cloudera: CDH 5.10+, Cloudera security cert (recommended)

Differentiators

Fast affordable genomic processing at scale.

Key highlights

Category
Drive Customer Insights

About MetiStream
Metistream is a Big Data services and solutions company specializing in real-time implementations and advanced analytics. We aim to help our customers realize and fully utilize their data’s potential. Located in the Washington DC area, we are a women-owned, minority-owned small business founded by industry experts with backgrounds from top technology companies.

Postive Business Outcomes

Hail enables you to quickly glean insights from massive amounts of data. Our solution also provides customers with the ability to pass their genomic data through to Cloudera’s Spark environment, making available the host of Machine Learning (ML) packages designed for Apache Spark. As a result, this service offering provides customers with a comprehensive, fast and cost efficient approach and framework for supporting the downstream whole genome pipeline process.  

Learn more about the solution

Yes, I would like to be contacted by Cloudera for newsletters, promotions, events and marketing activities. Please read our privacy and data policy.
Yes, I consent to my information being shared with Cloudera's solution partners to offer related products and services. Please read our privacy and data policy.

I agree to Cloudera's terms and conditions.

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extention blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.