A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters of industry-standard machines.
The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges
all intermediate values associated with the same intermediate key.
A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.
The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word
co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required