Apache Spark

The open standard for flexible, in-memory data processing for batch, real-time, and advanced analytics

What is Apache Spark?

Apache Spark is an open source, general data processing framework in the Apache Hadoop ecosystem that make it easy to develop fast, end-to-end Big Data applications combining batch, streaming, and interactive analytics on all your data. Apache Spark is a key component inside CDH, Cloudera’s open source platform, with full enterprise support and capabilities available via Cloudera Enterprise.

Spark in Hadoop

Cloudera is committed to adopting Apache Spark as a replacement for MapReduce in the Hadoop ecosystem as a core data execution engine for workloads. To help users make this transition, Cloudera's Apache committers are working to complement MapReduce with Spark in Hadoop ecosystem components, including Apache Crunch on Spark, Apache Solr on Spark, Apache HBase-Spark integration (Cloudera Labs), Hive on Spark (beta), and Apache Pig on Spark (alpha).

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Apache Spark Benefits

Turn data into actionable insights, and quickly iterate for maximum impact

For Developers and Data Scientists:

  • Easy development in Scala, Python, or Java via a rich set of operators and machine learning libraries
  • Increase productivity through single, expressive API for batch and streaming applications
  • Interactive development with significant performance improvements over MapReduce

Why Apache Spark is a Delight for Developers, Why Apache Spark is a Crossover Hit for Data Scientists, More How-tos

For Architects:

  • Reduce complexity and costs with the standard engine for batch, streaming, and advanced analytics
  • Seamless integration with the third-party tools you already use via a robust partner certification program (1,600+ partners) and dedicated Spark Accelerator Partner Program
  • Continual innovations from one of the most active open source communities in the world

The Future of Hadoop: A Deeper Look at Spark

The Cloudera Difference for Apache Spark

As the first platform vendor to ship and support Apache Spark and with more committers and contributors on staff than any competitor, only Cloudera offers:

  • The most experience in supporting production deployments across all industries and the broadest range of use cases (200+ customers, 300+ contributed patches, 43,000+ lines of code)
  • The deepest integration between Spark and the platform with unified resource management (via YARN), simple administration (via Cloudera Manager), and compliance-ready security and governance (via Apache Sentry and Cloudera Navigator) - all critical for running in production
  • The ability to impact the roadmap to meet customer requirements
  • Comprehensive Spark Training for developers and data scientists

Customer Successes

More customers run Apache Spark on Cloudera than any other platform, including: