What Tez Does
Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters. Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.
How Tez Works
Apache Tez’ improvement of data processing in Hadoop extend well beyond gains seen in Apache Hive and Apache Pig. The project has set the standard for true integration with YARN for interactive workloads. Read the following short descriptions about how Apache Tez completes core tasks.
Express, model and execute processing logic
Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data. A rich data flow definition API allows users to intuitively express complex query logic. The API fits well with query plans produced by higher-level declarative applications like Apache Hive and Apache Pig.
Model interaction between Input, Processor and Output Modules
Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. Input & Output determine the data format and how and where it is read or written. The Processor holds the data transformation logic. Tez does not impose any data format and only requires that Input, Processor and Output formats are compatible with each other.
Dynamically reconfigure graphs
Distributed data processing is dynamic, and it is difficult to determine optimal data movement methods in advance. More information is available during runtime, which may help optimize the execution plan further. So Tez includes support for pluggable vertex management modules to collect runtime information and change the dataflow graph dynamically to optimize performance and resource utilization.
Optimize performance and resource management
YARN manages resources in a Hadoop cluster, based on cluster capacity and load. The Tez execution engine framework efficiently acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.
API for defining directed acyclic graphs (DAGs)
Tez defines a simple Java API to express a DAG of data processing. The API has three components
- DAG – this defines the overall job. The user creates a DAG object for each data processing job.
- Vertex – this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
- Edge – this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. This model comes with inherent costs for process startup and initialization, handling stragglers and allocating each container via the YARN resource manager.