Managing Apache Hive User-Defined Functions

You can extend Hive SQL using Java-based user-defined functions (UDFs) and call the UDF from a Hive query.

Prerequisite: In a Hadoop- and Hive-compatible Java project, you write and compile user-defined functionality code into a JAR, and then export the UDF to a JAR.

In the Registering a UDF procedure below, you store the JAR on your cluster. Using Hive commands, you register the UDF. There are several configuration and registration methods that support ease of use, frequently modified UDFs, and Sentry security. For example, if you use the Hive aux JARs directory method, you need to restart HiveServer2 after registration. If you use another method, you do not need to restart HiveServer2.

Continue reading:

Registering a UDF in Hive
Direct JAR Reference Configuration
Hive Aux JARs Directory Configuration
Reloadable Aux JAR Configuration
Creating Temporary Functions
Updating a User-Defined Function
Calling a Hive UDF from Impala
Adding Built-in UDFs to the HiveServer2 Blacklist

Registering a UDF in Hive

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

You configure the cluster in one of several ways to find the JAR containing your UDF code, and then you register the UDF in Hive.

Assuming you just built your Java project in IntelliJ, navigate to the JAR in the /target directory of the project.
Choose one of the following methods for configuring the cluster to find the JAR, and then follow the respective step-by-step procedure in sections below:
- Direct JAR reference configuration
  Straight-forward, but recommended for development only. Does not support Sentry.
- Hive aux JARs directory configuration
  Prevents accidental overwriting of files or functions. Recommended for tested, stable UDFs to prevent accidental overwriting of files or functions. Does not support Sentry.
- Reloadable aux JAR configuration
  Avoids HiveServer restarts. Recommended if you anticipate making frequent changes to the UDF logic. Supports Sentry.
  
  If you connect to HiveServer through the load balancer, issuing the RELOAD command loads the JAR file only to the connected HiveServer. Consequently, if you have multiple HiveServer instances behind a load balancer, you must install the JAR file on each node. You also need to connect to each HS2 instance to issue the RELOAD command.
After configuring the cluster to find the JAR, use Beeline to start Hive.
- On the command line of a node that runs the HiveServer2 role, type Beeline.
- Use the FQDN of the HiveServer in your cluster to replace myhiveserver.com and enter the database user name and database password, or use the default hive user. For example: beeline -u jdbc:hive2://myhiveserver.com:10000 -n hive -p

Run one of the following CREATE FUNCTION commands that corresponds to your configuration:

Direct JAR reference

hive> CREATE FUNCTION <function_name> AS '<fully_qualified_class_name>' USING JAR 'hdfs:///<path/to/jar/in/hdfs>';

Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file. For example,

hive> CREATE FUNCTION udftypeof AS 'com.mycompany.hiveudf.TypeOf01' USING JAR 'hdfs:///user/max/udf/hiveudf-1.0-SNAPSHOT.jar';

Hive Aux JARs directory

hive> CREATE FUNCTION <function_name> AS '<fully_qualified_class_name>'

Restart HiveServer2.

Reloadable Aux JAR

If you use the Reloadable Aux JAR method, RELOAD uploads the JAR to the cluster.

hive> RELOAD; 
Hive> CREATE FUNCTION <function_name> AS '<fully_qualified_class_name>'

For example,

hive> RELOAD; 
hive> CREATE FUNCTION udftypeof AS 'com.mycompany.hiveudf.Typeof01';

The UDF is registered.

Direct JAR Reference Configuration

You can configure the cluster to find the JAR using the ADD JAR command on the Hive command line.

Upload the JAR from your Java project to your CDH cluster. For example, navigate to the JAR in the /target directory of an IntelliJ project, and upload a JAR named TypeOf-1.0-SNAPSHOT.jar.

Move the JAR to HDFS.

$ sudo su - hdfs
$ hdfs dfs -put TypeOf-1.0-SNAPSHOT.jar /user/max/udf/hiveudf-1.0-SNAPSHOT.jar

Start Hive from Beeline, and enter the following command:
```
hive> ADD JAR /user/max/udf/hiveudf-1.0-SNAPSHOT.jar
```
The JAR named hiveudf-1.0-SNAPSHOT.jar is added to the Hive classpath.
After configuring the cluster, register the UDF as described above.

This configuration method does not work when Beeline runs on a different host from HiveServer2. After using this command, replicate the JAR file to other data nodes in the cluster. You should replicate the JAR to the number of data nodes equal to the square root of the number of cluster nodes.

Hive Aux JARs Directory Configuration

You can configure the cluster to find a UDF JAR by setting hive.aux.jars.path to a directory, not a classpath, where you uploaded your JAR. Do not use the /opt/cloudera/parcels/CDH directory because that directory changes with every release. You must create and manage the directory on hosts that run Hive metastore, HiveServer2, or the Hive CLI. The directory location is set in the environment as HIVE_AUX_JARS_PATH and will generally override the hive.aux.jars.path property set in XML files, even if hive.aux.jars.path is set in an advanced configuration snippet.

After navigating to the JAR in the /target directory of an IntelliJ project, perform these steps:

Upload the JAR file to the host or hosts that run HiveServer2.
Give the hive user read, write, and execute access to the directory where the JAR resides, for example /opt/local/hive/lib/.
If the Hive Metastore runs on a different host or hosts, create the same directory as you created on the HiveServer2 on every Hive Metastore host. For example, create /opt/local/hive/lib/ on the Hive Metastore host. You do not need to copy the JAR file to the Hive Metastore host directory. Hive takes care of that when you register the UDF. If the directory is not present on the Hive Metastore host, Hive Metastore service does not start.
In the Cloudera Manager Admin Console >Hive service > Configuration tab.
Expand the Hive (Service-Wide) scope.
In Advanced, configure the Hive Auxiliary JARs Directory property with the path to the JAR, for example /opt/local/hive/lib/.
Save changes.
In Cloudera Manager Admin Console > Hive service > Actions, redeploy the Hive client configuration.
Restart the Hive service.
After configuring the cluster, register the UDF as described above.

Reloadable Aux JAR Configuration

You can configure the cluster to find a UDF JAR by adding the hive.reloadable.aux.jars.path property using the Hive Service Advanced Configuration Snippet (Safety Value). For example, after navigating to the JAR in the /target directory of an IntelliJ project, perform these steps:

Upload the JAR file to a directory on the a host the runs HiveServer2.
Give the hive user read, write, and execute access to the directory.
If the Hive Metastore runs on a different host or hosts, create the same directory as you created on the HiveServer2 on every Hive Metastore host. For example, create /opt/local/hive/lib/ on the Hive Metastore host. You do not need to copy the JAR file to the Hive Metastore host directory. Hive takes care of that when you register the UDF. If the directory is not present on the Hive Metastore host, Hive Metastore service does not start.
In Cloudera Manager Admin Console > Hive service > Configuration > Filters > Advanced, click Hive (Service-Wide) scope.
In Hive Service Advanced Configuration Snippet (Safety Value) for hive-site.xml, add the following property:
- name = hive.reloadable.aux.jars.path
- value = the path to the JAR file
Save changes.
In Cloudera Manager Admin Console > Hive service > Actions, redeploy the Hive client configuration.
Restart the Hive service.
This step is only necessary initiallly. Subsequently, you can add or remove JARs using RELOAD.
If you use Sentry, on the Hive command line grant privileges on the JAR files to the roles that require access.
```
GRANT ALL ON URI 'file:///opt/local/hive/lib/<my.jar>' TO ROLE <example_role>;
```
After configuring the cluster, register the UDF as described above.

Creating Temporary Functions

Use the TEMPORARY keyword in the CREATE FUNCTION command to register a temporary instead of a permanent UDF. For example:

hive> CREATE TEMPORARY FUNCTION <your_function_name> AS '<fully_qualified_class_name>'

hive> CREATE TEMPORARY FUNCTION <function_name> AS
  '<fully_qualified_class_name>' USING JAR
  'hdfs:///<path/to/jar/in/hdfs>';

Updating a User-Defined Function

When you change the UDF, you need to re-register it. If you use the ADD JAR method described above, simply drop the function, add the new one, and restart HiveServer2; otherwise, follow these steps to update a UDF:

Update your Java code in your Java project.
On the Hive command line, drop the UDF that has been updated.
```
hive> DROP FUNCTION hiveudf-1.0-SNAPSHOT.jar;    
```
Delete the old JAR file from the cluster.
Upload the updated JAR to the cluster to the location that corresponds to either your Hive Aux Jars Directory or Reloadable Aux Jars configuration.
If Sentry is enabled on your cluster, grant privileges on the JAR files to the roles that require access.
If you use Reloadable Aux JARs, start Hive from Beeline, and run the RELOAD command.
Run the CREATE FUNCTION command that corresponds to either your Hive Aux Jars Directory or Reloadable Aux Jars configuration as described above.
If you use Hive Aux Jars, redeploy the client configurations and restart HiveServer2.

Calling a Hive UDF from Impala

You can call a UDF that you register in Hive from an Impala query under the following conditions:

Sentry is not enabled.
When you register the UDF, you use the CREATE FUNCTION that includes the USING CLAUSE (either Direct JAR reference or Hive Aux JARs directory methods).
Other requirements described in Impala documentation.

The CREATE FUNCTION includes the JAR location; otherwise, Impala does not load the function. Impala relies on the location you provide during function creation. The JAR, which contains the UDF code, must reside on HDFS, making the JAR automatically available to all the Impala nodes. You do not need to manually copy any UDF-related files between servers.

If you cannot register the UDF, which you want to call from Impala, in Hive because, for example, you use Sentry, then register the UDF in Impala. Do not name an Impala-registered UDF the same as any Hive-registered UDF.

Adding Built-in UDFs to the HiveServer2 Blacklist

Built-in UDFs, such as the year function, are available in Hive natively. HiveServer2 maintains a blacklist for built-in UDFs to prevent attacks that use the hive user credentials to execute Java code. You set the hive.server2.builtin.udf.blacklist property to a comma separated list of built-in UDFs that Hive does not execute. A UDF that is included in the blacklist returns an error if invoked from a query. By default this property is empty.

To check whether hive.server2.builtin.udf.blacklist contains any UDFs, run the following statement on the Hive command line:

hive> SET hive.server2.builtin.udf.blacklist;

Any blacklisted UDFs are returned.

To add built-in UDF names to the hive.server2.builtin.udf.blacklist property with Cloudera Manager:

In the Cloudera Manager Admin Console, go to the Hive service.
On the Hive service page, click the Configuration tab.
On the Configuration page, click HiveServer2 under Scope and click Advanced under Category.
Search for HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml and add the following information:
- Name: hive.server2.builtin.udf.blacklist
- Value: <builtin_udf_name1>,<builtin_udf_name2>...
- Description: Blacklisted built-in UDFs.
Click Save Changes and restart the HiveServer2 service for the changes to take effect.

If you are not using Cloudera Manager to manage your cluster, set the hive.server2.builtin.udf.blacklist property in the hive-site.xml file.

Accessing Table Statistics

Hive ETL Jobs on S3