Hadoop Distributed File System (HDFS)
A user space filesystem designed for storing very large files with streaming data access patterns, running on clusters of industry-standard machines. HDFS defines three components:
- NameNode - Maintains the namespace tree for HDFS and a mapping of file blocks to DataNodes where the data is stored. A simple HDFS cluster can have only one primary NameNode, supported
by a secondary NameNode that periodically compresses the NameNode edits log file that contains a list of HDFS metadata modifications. This reduces the amount of disk space consumed by the log file on
the NameNode, which also reduces the restart time for the primary NameNode. A high availability cluster contains two
NameNodes: active and standby.
- DataNode - Stores data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized
computation can be executed near the data.
- JournalNode - Maintains a directory to log the modifications to the namespace metadata when using the Quorum-based Storage mechanism for providing high availability. During failover, the NameNode standby ensures that it has applied all of the
edits from the JournalNodes before promoting itself to the active state.