Support for Spark SQL and MLlib expands capabilities of the Hadoop platform for developers and data scientists
PALO ALTO, Calif. – November 30, 2015 – Cloudera, provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, announced today that it has further matured Apache Spark integration within Apache Hadoop environments, with critical achievements around usability and interoperability throughout the past year. To further expand the enterprise capabilities of this powerful data processing engine, Cloudera has added support for Spark SQL and MLlib into Cloudera Enterprise 5.5 and CDH 5.5, which the company launched recently.
Due to its development ease and flexible data processing, Spark has soared in popularity within the open source community and across customer use cases. It is the most active project in the Apache Software Foundation (ASF), with more than 800 developers from more than 200 companies. Cloudera’s team of Spark committers have been actively driving the enterprise capabilities of Spark and uniting Spark within Hadoop to meet customer needs and further production adoption (see infographic).
”The embrace of Spark by the developer community and Cloudera’s efforts in the past year to drive its mainstream adoption have been nothing short of remarkable,” said Doug Cutting, chief architect at Cloudera. “With the most customers running Spark with Hadoop, we have already made impressive strides in furthering the enterprise capabilities of Spark for Hadoop deployments across industries and use cases. With the addition of Spark SQL and MLlib to Cloudera’s platform, and a clear roadmap with the One Platform Initiative, Spark adoption will continue to soar for batch, streaming, and machine learning use cases.”
Cloudera and Spark: A Year in Review for Production Adoption
Over the past year, Cloudera has made significant strides in maturing Spark to address a wider range of data processing use cases, including end-to-end Internet of Things (IoT) applications, simpler batch processing, and native machine learning.
As more customers aimed to take advantage of Internet of Things and real-time streaming data, they needed an enterprise-grade stream processing engine to support their applications. To address this, Cloudera led development on Spark Streaming resiliency, ensuring zero data loss and bringing it up to production standards. This critical improvement, paired with the integration of Apache Kafka within the platform, has allowed Cloudera customers to build complete IoT applications within a unified platform and has had a drastic impact on Spark Streaming adoption overall.
To enable simpler, more powerful batch processing, and help solidify Spark’s place as the standard execution engine in Hadoop, Cloudera also released the beta of Apache Hive-on-Spark this year. As the tool-of-choice for ETL development, Hive integration with the Spark processing engine marks a significant milestone supporting next-generation data integration workloads and adoption of Spark as the successor of MapReduce.
Cloudera’s One Platform Initiative, announced in September, continues the acceleration of Apache Spark development for the enterprise and within the Hadoop ecosystem. Cloudera is making significant gains in enhancing Spark’s security, scale, management, and streaming capabilities, and will continue to focus heavily on this development in the coming year.
With the recent Cloudera 5.5 release, Cloudera has added Spark MLlib - broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop - and Spark SQL - extending the capabilities of Spark for developers and data scientists by allowing SQL to be seamlessly embedded within Spark applications. This release also included improvements made to Spark SQL’s query engine as part of Project Tungsten, providing significant improvements in efficiency and speed. For further functionality, integrations built with Hive and its metastore ensure full interoperability of data schemas with Spark SQL within the Hadoop platform - ensuring the right users have a seamless experience with the right tools for their job, whether it be ETL development with Hive, application development with SparkSQL, or interactive business intelligence with Impala.
Driving Broad Customer Adoption
With the most experience supporting Spark as part of Hadoop, Cloudera has more customers running Spark on Hadoop than all other vendors combined and powers some of the largest multi-tenant Spark clusters today, including deployments over 800 nodes.
With over 170 customers running Spark across a vast range industries, including finance, healthcare, retail, and insurance, Cloudera has helped customers embrace a wide range of next-generation use cases, including:
Cox Automotive: Leading provider of products and services for automotive dealers and car buyers, moved from hourly analytics to real-time insights into ad campaigns using Spark Streaming
PRGX: World's leading provider of accounts payable recovery audit services, stated Spark’s flexible, performant data processing has been a “saving grace” and resulted in a 9-10x performance improvement compared to legacy systems
Allstate: One of the nation’s largest insurance providers, uses Cloudera and Apache Spark to combine more than 80 years of data for highly refined pricing models
RelayHealth: Healthcare technology solution provider and subsidiary of McKesson, builds predictive models for when payments to healthcare providers will be received, improving their cash flow. The company processes healthcare payment interactions between 200,000 physicians, 2,000 hospitals, and 1,900 health plan subscribers
Barclays: Multinational banking and financial services company, builds an insights engine that securely analyzes previously disparate transaction data and delivers relevant insights to Barclays customers in an easily digestible manner
In addition, Cloudera’s Accelerator Program for Spark has driven dozens of robust Spark applications and integrations with the leading third-party tools, further expanding the capabilities of Spark to customers. Key partners include Datameer, Informatica, Oracle, Paxata, Pentaho, Platfora, StreamSets, Syncsort, Talend and Trifacta.
"Datameer is excited to see Cloudera’s continued investment in Spark as it has the potential to provide huge value to our customers thanks to its scalability and interactive performance," said Stefan Groschupf, CEO of Datameer. "Beyond the Spark Connector that we are announcing here at Strata + Hadoop World Singapore, we will also continue to work closely with Cloudera to develop those high-value use cases around Spark as well as other components of the Hadoop platform."
“The opportunity for Informatica and Cloudera to work together to further the development and deployment of Apache Spark alongside Hadoop is great for our joint customers,” said Sanjay Krishnamurthi, senior vice president and chief technology officer, Informatica. “These customers are leveraging Spark inside Informatica’s Big Data Management platform to deliver trusted analytics at scale. Together with Cloudera, we are providing high-speed discovery of data assets for holistic big data governance and security and for simpler big data integration, which ensures trust in the face of ever-growing data volumes.”
“Oracle delivers Spark-based, enterprise-grade products on-premises and in the cloud, including Oracle Big Data Discovery, Oracle R Advanced Analytics for Hadoop, Oracle Data Integrator, Oracle Big Data Appliance and Oracle Big Data Cloud Service,” said Neil Mendelson, vice president, Big Data & Advanced Analytics, Oracle. “The use of Spark in-memory processing has led to performance gains across our entire Big Data portfolio. We look forward to innovating with Cloudera and Intel to provide increased performance to our customers by leveraging Spark in our current and future Big Data products.”
"Paxata provides information-driven organizations with the most comprehensive platform designed for interactive self-service data preparation at massive scale," said Prakash Nanduri, CEO and Co-founder of Paxata. "As one of the leading vendors to fully leverage Spark, we are able to deliver a customer experience that is unhindered by data volumes, variety, or velocity. We are delighted to see Cloudera’s efforts and investments in development on Spark, as promised through the One Platform Initiative. Our participation in the Cloudera ecosystem and our decision to align with Cloudera's vision has been instrumental in our leadership in the market and our development success with Spark."
“Pentaho is focusing on future-proofing customers’ big data investments, and collaborating early with key partners on new and promising technologies, such as Spark, is one way we help fulfill on this promise,” said Will Gorman, VP of Labs at Pentaho, a Hitachi Group Company. “With the engineering resources available through the Cloudera Accelerator Program for Spark, Pentaho Labs, the innovation center here at Pentaho, can test-drive and collaborate on the new features in the world of enterprise big data analytics with Spark. By prototyping to real-world use cases, our customers can embrace native integrations to build next-gen architectures.”
"Platfora has made a major investment in Spark to make our platform Spark native," said Jason Zintak, CEO and President of Platfora. "Our customers want the speed of development and processing that Spark provides but they need the security and reliability that comes with a mature Hadoop platform. With Cloudera’s One Platform Initiative, Cloudera is taking a leadership position toward scalability, management and security of Spark, absolutely essential for Spark to cross the chasm. Platfora has been a founding member of the Cloudera Accelerator Program and is fully certified with Cloudera on Spark."
"Cloudera has once again led the way, evolving Apache Spark from an academic project to an enterprise-grade compute engine," said Arvind Prabhakar, Chief Technology Officer at StreamSets, "The release of Cloudera 5.5 is a key milestone and brings much anticipated features to the broader Hadoop community. StreamSets is committed to Spark as a key stream processing platform and is thrilled to see the advances in Spark Streaming thanks in part to the Cloudera Accelerator Program."
“We see widespread interest among our customers in using Spark to enable next-generation analytics, and expect enterprise adoption for Spark to accelerate quickly as innovations around security, scalability, and management come to market, driven by Cloudera’s One Platform Initiative and latest release of Cloudera Enterprise 5.5. As an Apache Spark community contributor and member of the Cloudera Accelerator Program for Spark, we look forward to continuing to work with Cloudera to help more organizations realize Spark’s full potential by making it easy to deploy with our “design once, deploy anywhere” data integration platform, and to enable streaming workloads like internet of things (IoT) through integration with Spark and Kafka,” said Tendü Yoğurtçu, General Manager of Syncsort’s Big Data business.
“Similar to Cloudera, Talend made an early and deep commitment to Apache Spark. We view participation in the Cloudera Accelerator Program as a key venture that will help further translate the raw power of this vital open source project into an enterprise-grade capability that can be rapidly deployed. Teaming with Cloudera, we are already seeing success in the field, now it’s about helping more companies recognize the business benefits of real-time data integration and analytics.” - Ashely Stirrup, CMO, Talend, Inc.
“Integration with Spark is critical to our mission of empowering analysts to intuitively explore and transform diverse data in Cloudera and a focus of our continuing integration with Cloudera’s technology,” said Wei Zheng, VP of Products, Trifacta. “With Spark powering at-scale profiling visualizations and transformation executions in Trifacta Wrangler Enterprise, we are able to provide our customers with a more fluid user experience leveraging the interactive performance of Spark.”
Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available for the modern world. Our customers efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at lower cost than has been possible before. To ensure our customers are successful, we offer comprehensive support, training and professional services. Learn more at http://cloudera.com.
Connect with Cloudera
Read our blog: blog.cloudera.com
Follow us on Twitter: twitter.com/cloudera
Visit us on Facebook: facebook.com/cloudera
Join the Cloudera Community: cloudera.com/community
Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Edition, Cloudera Navigator Optimizer and CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.