Community Effort Driving Standardization of Apache Spark Through Expanded Role in Hadoop Projects
Cloudera, Databricks, IBM, Intel, and MapR Collaborate to Foster Best-In-Class Open Source Standards
Palo Alto, Calif. – July 1, 2014 – Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR announced today that they are joining efforts to broaden support for Apache Spark (Spark), while simultaneously standardizing it as the framework of choice by bringing popular tools from the MapReduce world to this new engine.
Spark has quickly become a standard in many Hadoop distributions, with rapid customer adoption and use in a variety of use cases, ranging from machine learning to stream processing workloads. To further support this growth, these five vendors have come together to collectively broaden the range of tools and technologies in the Hadoop ecosystem that leverage Spark as an underlying processing engine.
Today, besides being used independently as a programming language, Spark is used as the basis for several projects including:
1. Spark Streaming for continuous data processing
2. MLLib for a machine learning toolkit
3. GraphX for graph analytics capabilities
In recent months, other projects have also added support for Spark, as evinced by recent efforts to port Crunch, Mahout, and Concurrent’s Cascading framework to Spark.
This collaborative new effort expands upon the Spark momentum to include several key Hadoop projects - starting with the Apache Hive SQL engine (Hive). Using Spark as the underlying execution engine, this effort will improve the performance of batch SQL jobs in Hive, while seamlessly maintaining compatibility with the core Hive code base.
Simultaneously, the group is investigating ways to adapt Apache Pig to leverage Spark, as well as other popular tools, such as Sqoop and Search. By making Spark the execution layer of choice, this group is driving consolidation and standardization around Spark as the evolution of MapReduce for modern hardware.
This effort highlights the power of open source communities, with marketplace competitors coming together to help shape a common execution layer, thus creating a community standard. End users benefit by having a widely supported execution layer, preventing lock-in, while continuing to use their tools of choice. Further, the simplicity of having to manage and learn a single engine reduces operational costs.
Spark is an open source data analytics framework originally developed in the AMPLab at UC Berkeley. Quickly embraced for its inherent advantages, such as improved data processing and in-memory capabilities on Hadoop, Spark offers application performance gains up to 100 times faster than Hadoop MapReduce for certain applications. Spark has attracted the attention of the open source community and vendors alike.
Hive is a data warehouse infrastructure initially developed by Facebook Inc. and built on top of Hadoop. Hive was created to query and manage large datasets stored across a cluster of servers. Hive continues to remain a popular choice for SQL batch processing and it offers many advantages to customers. There is an active community including enterprise vendors Cloudera, IBM, Intel and MapR, committed to furthering Hive based on cutting edge industry standards.
“The ecosystem of software related to Apache Hadoop is constantly evolving and expanding to include new ways to process Big Data. The rapid changes and additions signify a need among our customers for standardization on platform functionality to simplify the usage of Hadoop for business applications,” said Doug Cutting, co-creator of Hadoop and chief architect for Cloudera. “By creating an industry standard with Apache Spark, we help our customers realize a faster time to value from their deployments.”
“Since Spark was open sourced it has generated rapid interest–with over 200 contributors from 50+ organizations collaborating around the project; establishing itself as the most active project in the Hadoop ecosystem,” said Arsalan Tavakoli-Shiraji, Business Development Lead at Databricks. “With such a groundswell of support and clear benefits for realizing sophisticated analytics, we believe Spark is the future of data processing on Hadoop.”
“Big Data is an inflection point for next generation businesses. IBM is committed to Spark for data processing in Hadoop and to open source technologies across the Hadoop ecosystem. Big data is providing innumerable benefits to our customers as we focus on making Hadoop withstand the demands of the enterprise adding administrative, discovery, development, provisioning, security and governance, along with best-in-class analytical capabilities,” said Anjul Bhambhri, Vice President, Big Data.
“We are committed to accelerating the evolution of Hadoop through our strategic alliance with Cloudera and Intel contributions to key open source projects in the ecosystem,” said Vin Sharma, director, Strategy & Business Development for Intel’s Big Data Solutions. “Spark is delivering one of the most exciting innovations in data processing into the Hadoop platform. By working with the developer communities to establish Spark as the back-end for Hive, Pig, and other Hadoop projects, Intel aims to bring the latest advances in memory and processor technologies into widespread use by enterprises grappling with hyperscale analytics.”
“We believe Spark on Hadoop is a game changer for any business and moves Hadoop closer to supporting real-time, operational applications.” said M. C. Srivas, CTO and Co-founder of MapR Technologies. “MapR is dedicated to collaborating, contributing to and supporting these open source projects and customers will benefit immensely from the Hive and Spark communities coming together to significantly enhance Hadoop and move it forward.”
Cloudera, Databricks, IBM, Intel, and MapR agree that by bringing Spark more widely to Hadoop communities, the outcome will be a rich and unified ecosystem that will deliver the next level of performance in Hadoop deployments. A proposal for Apache Hive on Apache Spark has been submitted and companies anticipate work immediately upon its approval. Hive on Spark will work within the context of the existing Hive community and establish Spark as the back end standard to improve Hive performance.
Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera's open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 40,000 individuals worldwide. Over 1,700 partners and a seasoned professional services team help deliver greater time to value. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production.
Connect With Cloudera
Follow us on Twitter: http://twitter.com/cloudera
Visit us on Facebook: http://www.facebook.com/cloudera
Join the Cloudera Community: http://cloudera.com/community
Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Editionand CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.