Installing DataFu

DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation. They were developed by LinkedIn and are now open source under an Apache 2.0 license.

A number of usage examples and other information are available at https://github.com/linkedin/datafu.

To Use DataFu in a Parcel-deployed Cluster

If your cluster uses parcels, DataFu is installed for you. You need to register the JAR file prior to use with the following command.

REGISTER /opt/cloudera/parcels/CDH/lib/pig/datafu.jar

To Use DataFu in a Package-deployed Cluster:

  1. Install the DataFu package:

    Operating system

    Install command

    Red-Hat-compatible

    sudo yum install pig-udf-datafu

    SLES

    sudo zypper install pig-udf-datafu

    Debian or Ubuntu

    sudo apt-get install pig-udf-datafu

    This puts the DataFu JAR file (for example, datafu-0.0.4-cdh5.0.0.jar) in /usr/lib/pig.

  2. Register the JAR. Replace the <component_version> string with the current DataFu and CDH version numbers.
    REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar

    For example:

    REGISTER /usr/lib/pig/datafu-0.0.4-cdh5.0.0.jar