The open standard for flexible, in-memory data processing for batch, real-time, and advanced analytics
What is Apache Spark?
Apache Spark is an open source, general data processing framework that complements Apache Hadoop to make it easy to develop fast, end-to-end Big Data applications combining batch, streaming, and interactive analytics on all your data. Apache Spark is a key component inside CDH, Cloudera’s open source platform, with full enterprise support and capabilities available via Cloudera Enterprise.
Spark in Hadoop
Cloudera is committed to adopting Apache Spark as a replacement for MapReduce in the Hadoop ecosystem as a core data execution engine for workloads. To help users make this transition, Cloudera's Apache committers are working to complement MapReduce with Spark in Hadoop ecosystem components, including Apache Crunch on Spark, Apache Solr on Spark, Apache HBase-Spark integration (Cloudera Labs), Hive on Spark (beta), and Apache Pig on Spark (alpha).
Apache Spark Benefits
Turn data into actionable insights, and quickly iterate for maximum impact
For Developers and Data Scientists:
- Easy development in Scala, Python, or Java via a rich set of operators and machine learning libraries
- Increase productivity through single, expressive API for batch and streaming applications
- Interactive development with significant performance improvements over MapReduce
- Reduce complexity and costs with the standard engine for batch, streaming, and advanced analytics
- Seamless integration with the third-party tools you already use via a robust partner certification program (1,600+ partners) and dedicated Spark Accelerator Partner Program
- Continual innovations from one of the most active open source communities in the world
The Cloudera Difference for Apache Spark
As the first platform vendor to ship and support Apache Spark and with more committers and contributors on staff than any competitor, only Cloudera offers:
- The most experience in supporting production deployments across all industries and the broadest range of use cases (200+ customers, 300+ contributed patches, 43,000+ lines of code)
- The deepest integration between Spark and the platform with unified resource management (via YARN), simple administration (via Cloudera Manager), and compliance-ready security and governance (via Apache Sentry and Cloudera Navigator) - all critical for running in production
- The ability to impact the roadmap to meet customer requirements
- Comprehensive Spark Training for developers and data scientists
More customers run Apache Spark on Cloudera than any other platform, including: