Managing Apache Hive User-Defined Functions (UDFs) in CDH

Hive's query language (HiveQL) can be extended with Java-based user-defined functions (UDFs). See the Apache Hive Language Manual UDF page for information about Hive built-in UDFs. To create customized UDFs, see the Apache Hive wiki. After creating a new Java class to extend the com.example.hive.udf package, you must compile your code into a Java archive file (JAR), and add it to the Hive classpath with the ADD JAR command. The ADD JAR command does not work with HiveServer2 and the Beeline client when Beeline runs on a different host. As an alternative to ADD JAR, Hive's auxiliary paths functionality should be used.

Perform one of the following procedures depending on whether you want to create permanent or temporary functions.

Blacklist for Built-in UDFs

HiveServer2 maintains a blacklist for built-in UDFs to secure itself against attacks in a multiuser scenario where the hive user's credentials can be used to execute any Java code.
hive.server2.builtin.udf.blacklist A comma separated list of built-in UDFs that are not allowed to be executed. A UDF that is included in the list will return an error if invoked from a query.

Default value: Empty

User-Defined Functions (UDFs) with HiveServer2 Using Cloudera Manager

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

Creating Permanent Functions

  1. Copy the JAR file to HDFS and make sure the hive user can access this JAR file.
  2. Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute access to this directory, and make a note of the path (for example, /opt/local/hive/lib/).
  3. In the Cloudera Manager Admin Console, go to the Hive service.
  4. Click the Configuration tab.
  5. Expand the Hive (Service-Wide) scope.
  6. Click the Advanced category.
  7. Configure the Hive Auxiliary JARs Directory property with the HiveServer2 host path and the Hive Metastore host path from Step 2, for example /opt/local/hive/lib/. Setting this property overwrites hive.aux.jars.path, even if this variable has been previously set in the HiveServer2 advanced configuration snippet.
  8. Click Save Changes. The JARs are added to HIVE_AUX_JARS_PATH environment variable.
  9. Redeploy the Hive client configuration.
    1. In the Cloudera Manager Admin Console, go to the Hive service.
    2. From the Actions menu at the top right of the service page, select Deploy Client Configuration.
    3. Click Deploy Client Configuration.
  10. Restart the Hive service.
  11. With Sentry enabled - Grant privileges on the JAR files to the roles that require access. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
    GRANT ALL ON URI 'file:///opt/local/hive/lib/my.jar' TO ROLE
                  EXAMPLE_ROLE
  12. Run the CREATE FUNCTION command to create the UDF from the JAR file.
    With Sentry enabled - On a Sentry secured cluster, the USING JAR command is not supported. To load the jar, you have to make sure the JAR file is at the location pointed to by either hive.aux.jars.path or hive.reloadable.aux.jars.path.
    CREATE FUNCTION addfunc AS 'com.example.hiveserver2.udf.add';
    Without Sentry - Run the CREATE FUNCTION command as follows and point to the JAR file location in HDFS. For example:
    CREATE FUNCTION addfunc AS 'com.example.hiveserver2.udf.add' USING JAR 'hdfs:///path/to/jar'

Creating Temporary Functions

  1. Copy the JAR file to the host on which HiveServer2 is running. Save the JARs to any directory you choose, give the hive user read, write, and execute access to this directory, and make a note of the path (for example, /opt/local/hive/lib/).
  2. In the Cloudera Manager Admin Console, go to the Hive service.
  3. Click the Configuration tab.
  4. Expand the Hive (Service-Wide) scope.
  5. Click the Advanced category.
  6. Configure the Hive Auxiliary JARs Directory property with the HiveServer2 host path and the Hive Metastore host path from Step 1, for example /opt/local/hive/lib/. Setting this property overwrites hive.aux.jars.path, even if this variable has been previously set in the HiveServer2 advanced configuration snippet.
  7. Click Save Changes. The JARs are added to HIVE_AUX_JARS_PATH environment variable.
  8. Redeploy the Hive client configuration.
    1. In the Cloudera Manager Admin Console, go to the Hive service.
    2. From the Actions menu at the top right of the service page, select Deploy Client Configuration.
    3. Click Deploy Client Configuration.
  9. Restart the Hive service.
  10. With Sentry enabled - Grant privileges on the JAR files to the roles that require access. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
    GRANT ALL ON URI 'file:///opt/local/hive/lib/my.jar' TO ROLE EXAMPLE_ROLE
  11. Run the CREATE TEMPORARY FUNCTION command. For example:
    CREATE TEMPORARY FUNCTION addfunc AS 'com.example.hiveserver2.udf.add'

User-Defined Functions (UDFs) with HiveServer2 Using the Command Line

The following sections describe how to create permanent and temporary functions using the command line.

Creating Permanent Functions

  1. Copy the JAR file to HDFS and make sure the hive user can access this jar file.
  2. On the Beeline client machine, in /etc/hive/conf/hive-site.xml, set the hive.aux.jars.path property to a comma-separated list of the fully qualified paths to the JAR file and any dependent libraries.
    hive.aux.jars.path=file:///opt/local/hive/lib/my.jar
  3. Copy the JAR file (and its dependent libraries) to the host running HiveServer2/Impala. Make sure the hive user has read, write, and execute access to these files on the HiveServer2/Impala host.
  4. On the HiveServer2/Impala host, open /etc/default/hive-server2 and set the AUX_CLASSPATH variable to a comma-separated list of the fully qualified paths to the JAR file and any dependent libraries.
    AUX_CLASSPATH=/opt/local/hive/lib/my.jar
  5. Restart HiveServer2.
  6. With Sentry enabled - Grant privileges on the JAR files to the roles that require access. Login to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
    GRANT ALL ON URI 'file:///opt/local/hive/lib/my.jar' TO ROLE EXAMPLE_ROLE
    If you are using Sentry policy files, you can grant the URI privilege as follows:
    udf_r = server=server1->uri=file:///opt/local/hive/lib
    udf_r = server=server1->uri=hdfs:///path/to/jar
  7. Run the CREATE FUNCTION command to create the UDF from the JAR file.
    With Sentry enabled - On a Sentry secured cluster, the USING JAR command is not supported. To load the jar, you have to make sure the JAR file is at the location pointed to by either hive.aux.jars.path or hive.reloadable.aux.jars.path.
    CREATE FUNCTION addfunc AS 'com.example.hiveserver2.udf.add';
    Without Sentry - Run the CREATE FUNCTION command as follows and point to the JAR file location in HDFS. For example:
    CREATE FUNCTION addfunc AS 'com.example.hiveserver2.udf.add' USING JAR 'hdfs:///path/to/jar'

Creating Temporary Functions

  1. On the Beeline client machine, in /etc/hive/conf/hive-site.xml, set the hive.aux.jars.path property to a comma-separated list of the fully qualified paths to the JAR file and any dependent libraries.
    hive.aux.jars.path=file:///opt/local/hive/lib/my.jar
  2. Copy the JAR file (and its dependent libraries) to the host running HiveServer2/Impala. Make sure the hive user has read, write, and execute access to these files on the HiveServer2/Impala host.
  3. On the HiveServer2/Impala host, open /etc/default/hive-server2 and set the AUX_CLASSPATH variable to a comma-separated list of the fully qualified paths to the JAR file and any dependent libraries.
    AUX_CLASSPATH=/opt/local/hive/lib/my.jar
  4. If Sentry is enabled - Grant privileges on the local JAR files to the roles that require access. Login to Beeline as user hive and use the Hive SQL GRANT statement to do so. For example:
    GRANT ALL ON URI 'file:///opt/local/hive/lib/my.jar' TO ROLE EXAMPLE_ROLE
    If you are using Sentry policy files, you can grant the URI privilege as follows:
    udf_r = server=server1->uri=file:///opt/local/hive/lib
  5. Restart HiveServer2.
  6. Run the CREATE TEMPORARY FUNCTION command and point to the JAR from Hive:
    CREATE TEMPORARY FUNCTION addfunc AS 'com.example.hiveserver2.udf.add'

Update Existing HiveServer2 User-Defined Functions (UDFs)

Consider a sample UDF, hive_udf.jar that is already in use. If you want to update the sample function, my_udf, in the hive_udf.jar JAR file, you first drop my_udf using the DROP FUNCTION command. Then, delete the old JAR file and place the new JAR with the updated function in both, HDFS and the local filesystem. Finally, re-create the my_udf function defined in the new JAR using the CREATE FUNCTION command. These steps have been described in detail below:

  1. Drop the function to be updated, in this case, my_udf. This can be done by using bash scripts to first, connect to HiveServer2 using Beeline, and then, drop the function. First create an .hql file with the DROP FUNCTION command:
    # more drop_func.hql 
    
    drop function my_udf; 
    Create a bash script to connect to HiveServer2 using Beeline and the previously-created drop_func.hql:
    # more drop_func.sh 
    
    beeline -u "jdbc:hive2://hiveServer2:10000/default;principal=hive/HiveServer2@CLOUDERA.COM" -f drop_func.hql
    Run the bash script:
    # ./drop_func.sh
  2. Delete the old JAR file, hive_udf.jar from both HDFS, and the local filesystem. On the local filesystem, the JAR can be found at the location pointed to by either hive.aux.jars.path or hive.reloadable.aux.jars.path.
  3. Place the updated hive_udf.jar in HDFS and the local filesystem (for example, /user/hive/udf). This time, make sure you add the JAR to the path set by hive.reloadable.aux.jars.path. You do not need to place the JAR file at the path pointed to by hive.aux.jars.path.
    If you are using Cloudera Manager, use an advanced configuration snippet to set the value for hive.reloadable.aux.jars.path.
    1. In the Cloudera Manager Admin Console, go to the Hive service.
    2. Click the Configuration tab.
    3. Expand the Hive (Service-Wide) > Advanced categories.
    4. Locate the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml property and add:
      <property>
          <name>hive.reloadable.aux.jars.path</name>
          <value>/user/hive/udf</value>
      </property>
    5. Delete the value for the Hive Auxiliary JARs Directory property.
    6. Click Save Changes to commit the changes.
  4. Recreate functions defined in hive-udf.jar. This can also be done by using bash scripts to connect to HiveServer2 and run the CREATE FUNCTION command. First create an .hql file with the CREATE FUNCTION command:
     # more create_func.hql 
    
    CREATE FUNCTION my_udf as ‘com.cloudera.sa.hiveudf.myudf’;
    Create a bash script to connect to HiveServer2 using Beeline and the previously-created create_func.hql:
    # more create_func.sh 
    
    beeline -u "jdbc:hive2://hiveServer2:10000/default;principal=hive/HiveServer2@CLOUDERA.COM" -f create_func.hql
    Then run the bash script:
    # ./create_func.sh
  5. If you already have a Beeline session open before the JAR was updated, run the RELOAD command. The new UDF will take effect once the command completes successfully.