Nearly 1,600 institutions from more than 130 countries worldwide share data through GBIF.org, making it one of the world’s largest sources of information about where and when all organisms—plants, animals, fungi, and microbes—have been observed or collected over the past four centuries.
To fulfill its mission of giving anyone anywhere in the world free and open access to biodiversity data via the Internet and support large-scale knowledge generation and data analysis, the GBIF Secretariat in Copenhagen relies on the flexibility of the Cloudera tools and systems to deploy its data lake.
Biological researchers around the world today fundamentally rely on data—their own and others’—to carry out their research. But without a collective framework for storing and sharing it, much of the data about life on earth would be used once and then sit discarded and forgotten on disconnected computers without contributing to wider knowledge.
GBIF’s global infrastructure required a platform that would make it easier for scientists, researchers, and institutions both to share and to access data. Early versions of the infrastructure were built around MySQL which quickly proved inefficient for dealing with large amounts of data. The facility needed a platform to handle demand leading to real-time data collection and classification. Introduced initially to keep heavy processing load from MySQL, the Hadoop ecosystem now provides the core platform for GBIF to integrate, process, index and analyse all incoming data.
GBIF’s 100 formal members and 1,600 data publishers form a distributed network of both experts and infrastructure. When a data publisher creates or updates a dataset in a GBIF-connected repository, the crawling infrastructure brings these changes into the data lake. Newly arrived data goes through a series of formatting, quality control and enrichment and made available to the data analysts through Hive and to the public through search indexes. With MySQL, the team was having to stop the crawlers from performing their task as the sheer volume of data was crushing the database.
With its small informatics team and limited resources, the GBIF Secretariat turned to Cloudera to set the foundation of a modern data architecture. “Switching from MySQL to Hadoop provided us the scalability and flexibility we needed to process, analyse and distribute the volumes of data we see,” said Tim Robertson, who leads the GBIF informatics team. “Perhaps most importantly, though, Cloudera enables us to maximize the resources of a small team. The Cloudera distribution provides us compatible versions of the products (HBase, Solr, Hive etc), easy deployment and updating along with monitoring, alerting and diagnostic tools. In fact, we were able to save the effort of about one full-time employee, or 20% of our capacity. This means we are able to focus our effort on building software specific to biodiversity data and spend far less time on the internal plumbing of the platform.”
After implementing the enterprise data platform, GBIF was able to support the weight of its volumes, regularly updating indexes at 10,000 records per second.
Through Cloudera, GBIF was able to maximize the resources of its small team. In doing so, it enabled data sharing and access and significantly improved operational efficiencies, freeing up time to focus on other challenges. The open-sourced platform is critically important in balancing out the team’s resources for developing, managing and supporting GBIF’s open data repository. Without the community and the ability to look at the source code, the team would have been unable to achieve results at this scale and speed.
GBIF achieves real-time data analysis, recovery and index updates, all managed within the enterprise data platform. Whether a scientist would like to know how climate change or invasive species will affect patterns of life on earth or alter the benefits we depend on from natural systems, GBIF provides a framework for making the latest data readily available and easily accessible.