HDFS & MapReduce
There are two primary components at the core of Apache Hadoop: the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These open source projects, inspired by technologies created inside Google and developed by Cloudera Chief Architect Doug Cutting, form the foundation of the Apache Hadoop ecosystem.
The Hadoop Distributed File System (HDFS)
HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.
Key HDFS Features:
- Scale-Out Architecture - Add servers to increase capacity
- High Availability - Serve mission-critical workflows and applications
- Fault Tolerance - Automatically and seamlessly recover from failures
- Flexible Access – Multiple and open frameworks for serialization and file system mounts
- Load Balancing - Place data intelligently for maximum efficiency and utilization
- Tunable Replication - Multiple copies of each file provide data protection and computational performance
- Security - POSIX-based file permissions for users and groups with optional LDAP integration
MapReduce is a massively scalable, parallel processing framework that works in tandem with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data, rather than moving data to the compute location; data storage and computation coexist on the same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity.
Key MapReduce Features:
- Scale-out Architecture - Add servers to increase processing power
- Security & Authentication - Works with HDFS and HBase security to make sure that only approved users can operate against the data in the system
- Resource Manager - Employs data locality and server resources to determine optimal computing operations
- Optimized Scheduling - Completes jobs according to prioritization
- Flexibility – Procedures can be written in virtually any programming language
- Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently and restart automatically