Hadoop Overview
Apache Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware.
Hadoop excels at doing complex analyses, including detailed, special-purpose computation, across large collections of data. Hadoop handles search, log processing, recommendation systems, data warehousing and video/image analysis. Unlike traditional databases, Hadoop scales to address the needs of data-intensive distributed applications in a reliable, cost-effective manner.
HDFS and MapReduce
Hadoop creates clusters of machines and coordinates work among them. Clusters can be built and scaled out with inexpensive computers.
The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.
Fault-tolerant Hadoop Distributed File System (HDFS)
Provides reliable, scalable, low-cost storage.

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.
In addition, Hadoop includes MapReduce, a parallel distributed processing system that is different from most similar systems on the market. It was designed for clusters of commodity, shared-nothing hardware. No special programming techniques are required to run analyses in parallel using MapReduce; most existing algorithms work without changes. MapReduce takes advantage of the distribution and replication of data in HDFS to spread execution of any job across many nodes in a cluster.
MapReduce Software Framework
Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation.

- Processes large jobs in parallel across many nodes and combines results.
- Eliminates the bottlenecks imposed by monolithic storage systems.
- Results are collated and digested into a single output after each piece has been analyzed.
If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are self-healing for both storage and computation without requiring intervention by systems administrators.
