This is the documentation for Cloudera 5.2.x.
Documentation for other versions is available at Cloudera Documentation.

Installing DataFu

DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation that were developed by LinkedIn and have now been open sourced under an Apache 2.0 license.

To use DataFu:

  1. Install the DataFu package:

    Operating system

    Install command

    Red-Hat-compatible

    sudo yum install pig-udf-datafu

    SLES

    sudo zypper install pig-udf-datafu

    Debian or Ubuntu

    sudo apt-get install pig-udf-datafu

    This puts the datafu-0.0.4-cdh5.0.0.jar file in /usr/lib/pig.

  2. Register the JAR. Replace the <component_version> string with the current DataFu and CDH version numbers.
    REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar

    For example,

    REGISTER /usr/lib/pig/datafu-0.0.4-cdh5.0.0.jar

A number of usage examples and other information are available at https://github.com/linkedin/datafu.