Apache Spark

The open standard for flexible, in-memory data processing for batch, real-time, and advanced analytics

What is Apache Spark?

Apache Spark is an open source, general data processing framework in the Apache Hadoop ecosystem that make it easy to develop fast, end-to-end Big Data applications combining batch, streaming, and interactive analytics on all your data. Apache Spark is a key component inside CDH, Cloudera’s open source platform, with full enterprise support and capabilities available via Cloudera Enterprise.

Apache Spark in Hadoop

Cloudera is committed to adopting Apache Spark as a replacement for MapReduce in the Hadoop ecosystem as a core data execution engine for workloads. To help users make this transition, Cloudera's Apache committers are working to complement MapReduce with Spark in Hadoop ecosystem components, including Apache Crunch on Spark, Apache Solr on Spark, Apache HBase-Spark integration (Cloudera Labs), Hive on Spark (beta), and Apache Pig on Spark (alpha).

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Apache Spark Benefits

Turn data into actionable insights, and quickly iterate for maximum impact

For Developers and Data Scientists:

  • Easy development in Scala, Python, or Java via a rich set of operators and machine learning libraries
  • Increase productivity through single, expressive API for batch and streaming applications
  • Interactive development with significant performance improvements over MapReduce

Why Apache Spark is a Delight for Developers, Why Apache Spark is a Crossover Hit for Data Scientists, More How-tos

For Architects:

  • Reduce complexity and costs with the standard engine for batch, streaming, and advanced analytics
  • Seamless integration with the third-party tools you already use via a robust partner certification program (1,600+ partners) and dedicated Spark Accelerator Partner Program
  • Continual innovations from one of the most active open source communities in the world

The Future of Hadoop: A Deeper Look at Spark

The Cloudera Difference for Apache Spark

As the first platform vendor to ship and support Apache Spark and with more committers and contributors on staff than any competitor, only Cloudera offers:

  • The most experience in supporting production deployments across all industries and the broadest range of use cases (200+ customers, 300+ contributed patches, 43,000+ lines of code)
  • The deepest integration between Spark and the platform with unified resource management (via YARN), simple administration (via Cloudera Manager), and compliance-ready security and governance (via Apache Sentry and Cloudera Navigator) - all critical for running in production
  • The ability to impact the roadmap to meet customer requirements
  • Comprehensive Spark Training for developers and data scientists

Customer Successes

More customers run Apache Spark on Cloudera than any other platform, including:

Future of Apache Spark

Uniting Spark and Hadoop through the One Platform Initiative

Apache Spark is well-positioned to replace MapReduce but for customers to fully embrace it within Apache Hadoop, there is still work to be done to make it enterprise-grade. The One Platform Initiative is the driving force behind the community goal of making Spark the standard data processing engine for Hadoop. To achieve this vision of Spark replacing MapReduce, Cloudera, together with the community, will specifically address:

Key Area


Planned Work

Leverage Hadoop-native resource management
Initial Spark-on-YARN integration for shared resource management
  Add metrics for easy diagnostics
    Improve Spark-on-YARN for better multi-tenancy, performance and ease of use
    Automate configurations to optimize over time
    Visibility into resource utilization
    Improved PySpark integration for Python access
Full support for Hadoop security and beyond
Authentication through Kerberos integration
    Authorization through fine-grained access controls
    Governance through audit and lineage
    Integration with Intel’s Advanced Encryption libraries
    Achieve full Spark PCI compliance
Enable 10k-node clusters
Improved integration with HDFS to enable scheduling based on data locality and cached data
    Reduce memory pressure on large jobs
    Dynamic resource utilization and prioritization
    Stress test at scale with mixed multi-tenant workloads
Support for 80% of common stream processing workloads
Spark Streaming resiliency for zero data loss
  Data ingest integration with Kafka and Flume
    Improved state management for performance improvements
    Open up real-time workloads to a wider audience with higher-level language extensions

Cloudera Developer Blog »