This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Built-in Functions

Impala supports several categories of built-in functions. These functions let you perform mathematical calculations, string manipulation, date calculations, and other kinds of data transformations directly in SELECT statements. The built-in functions let a SQL query return results with all formatting, calculating, and type conversions applied, rather than performing time-consuming postprocessing in another application. By applying function calls where practical, you can make a SQL query that is as convenient as an expression in a procedural programming language or a formula in a spreadsheet.

The categories of functions supported by Impala are:

Mathematical Functions
Type Conversion Functions
Date and Time Functions
Conditional Functions
String Functions
Aggregation functions, explained in Aggregate Functions.

You call any of these functions through the SELECT statement. For most functions, you can omit the FROM clause and supply literal values for any required arguments:

select abs(-1);
select concat('The rain ', 'in Spain');
select power(2,5);

When you use a FROM clause and specify a column name as a function argument, the function is applied for each item in the result set:

select concat('Country = ',country_code) from all_countries where population > 100000000;
select round(price) as dollar_value from product_catalog where price between 0.0 and 100.0;

Typically, if any argument to a built-in function is NULL, the result value is also NULL:

select cos(null);
select power(2,null);
select concat('a',null,'b');

Aggregate functions are a special category with different rules. These functions calculate a return value across all the items in a result set, so they require a FROM clause in the query:

select count(product_id) from product_catalog;
select max(height), avg(height) from census_data where age > 20;

Aggregate functions also ignore NULL values rather than returning a NULL result. For example, if some rows have NULL for a particular column, those rows are ignored when computing the AVG() for that column. Likewise, specifying COUNT(col_name) in a query counts only those rows where col_name contains a non-NULL value.

Mathematical Functions

Impala supports the following mathematical functions:

abs(double a)

Purpose: Returns the absolute value of the argument.

Return type: double

Usage notes: Use this function to ensure all return values are positive. This is different than the positive() function, which returns its argument unchanged (even if the argument was negative).

acos(double a)

Purpose: Returns the arccosine of the argument.

Return type: double

asin(double a)

Purpose: Returns the arcsine of the argument.

Return type: double

atan(double a)

Purpose: Returns the arctangent of the argument.

Return type: double

bin(bigint a)

Purpose: Returns the binary representation of an integer value, that is, a string of 0 and 1 digits.

Return type: string

ceil(double a), ceiling(double a)

Purpose: Returns the smallest integer that is greater than or equal to the argument.

Return type: int

conv(bigint num, int from_base, int to_base), conv(string num, int from_base, int to_base)

Purpose: Returns a string representation of an integer value in a particular base. The input value can be a string, for example to convert a hexadecimal number such as fce2 to decimal. To use the return value as a number (for example, when converting to base 10), use CAST() to convert to the appropriate type.

Return type: string

cos(double a)

Purpose: Returns the cosine of the argument.

Return type: double

degrees(double a)

Purpose: Converts argument value from radians to degrees.

Return type: double

e()

Purpose: Returns the mathematical constant e.

Return type: double

exp(double a)

Purpose: Returns the mathematical constant e raised to the power of the argument.

Return type: double

floor(double a)

Purpose: Returns the largest integer that is less than or equal to the argument.

Return type: int

fmod(double a, double b), fmod(float a, float b)

Purpose: Returns the modulus of a number.

Return type: float or double, depending on type of arguments

Added in: Impala 1.1.1

fnv_hash(type v),

Purpose: Returns a consistent 64-bit value derived from the input argument, for convenience of implementing hashing logic in an application.

Return type: BIGINT

Usage notes:

You might use the return value in an application where you perform load balancing, bucketing, or some other technique to divide processing or storage.

Because the result can be any 64-bit value, to restrict the value to a particular range, you can use an expression that includes the ABS() function and the % (modulo) operator. For example, to produce a hash value in the range 0-9, you could use the expression ABS(FNV_HASH(x)) % 10.

This function implements the same algorithm that Impala uses internally for hashing, on systems where the CRC32 instructions are not available.

This function implements the Fowler–Noll–Vo hash function, in particular the FNV-1a variation. This is not a perfect hash function: some combinations of values could produce the same result value. It is not suitable for cryptographic use.

Examples:

[localhost:21000] > create table h (x int, s string);
[localhost:21000] > insert into h values (0, 'hello'), (1,'world'), (1234567890,'antidisestablishmentarianism');
[localhost:21000] > select x, fnv_hash(x) from h;
+------------+----------------------+
| x          | fnv_hash(x)          |
+------------+----------------------+
| 0          | -2611523532599129963 |
| 1          | 4307505193096137732  |
| 1234567890 | 3614724209955230832  |
+------------+----------------------+
[localhost:21000] > select s, fnv_hash(s) from h;
+------------------------------+---------------------+
| s                            | fnv_hash(s)         |
+------------------------------+---------------------+
| hello                        | 6414202926103426347 |
| world                        | 6535280128821139475 |
| antidisestablishmentarianism | -209330013948433970 |
+------------------------------+---------------------+
[localhost:21000] > select s, abs(fnv_hash(s)) % 10 from h;
+------------------------------+-------------------------+
| s                            | abs(fnv_hash(s)) % 10.0 |
+------------------------------+-------------------------+
| hello                        | 8                       |
| world                        | 6                       |
| antidisestablishmentarianism | 4                       |
+------------------------------+-------------------------+

For short argument values, the high-order bits of the result have relatively low entropy:

[localhost:21000] > create table b (x boolean);
[localhost:21000] > insert into b values (true), (true), (false), (false);
[localhost:21000] > select x, fnv_hash(x) from b;
+-------+---------------------+
| x     | fnv_hash(x)         |
+-------+---------------------+
| true  | 2062020650953872396 |
| true  | 2062020650953872396 |
| false | 2062021750465500607 |
| false | 2062021750465500607 |
+-------+---------------------+

Added in: Impala 1.2.2

greatest(bigint a[, bigint b ...]), greatest(double a[, double b ...]), greatest(string a[, string b ...]), greatest(timestamp a[, timestamp b ...])

Purpose: Returns the largest value from a list of expressions.

Return type: same as the initial argument value, except that integer values are promoted to BIGINT and floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column

hex(bigint a), hex(string a)

Purpose: Returns the hexadecimal representation of an integer value, or of the characters in a string.

Return type: string

least(bigint a[, bigint b ...]), least(double a[, double b ...]), least(string a[, string b ...]), least(timestamp a[, timestamp b ...])

Purpose: Returns the smallest value from a list of expressions.

ln(double a)

Purpose: Returns the natural logarithm of the argument.

Return type: double

log(double base, double a)

Purpose: Returns the logarithm of the second argument to the specified base.

Return type: double

log10(double a)

Purpose: Returns the logarithm of the argument to the base 10.

Return type: double

log2(double a)

Purpose: Returns the logarithm of the argument to the base 2.

Return type: double

negative(int a), negative(double a)

Purpose: Returns the argument with the sign reversed; returns a positive value if the argument was already negative.

Return type: int or double, depending on type of argument

Usage notes: Use -abs(a) instead if you need to ensure all return values are negative.

pi()

Purpose: Returns the constant pi.

Return type: double

pmod(int a, int b), pmod(double a, double b)

Purpose: Returns the positive modulus of a number.

Return type: int or double, depending on type of arguments

positive(int a), positive(double a)

Purpose: Returns the original argument unchanged (even if the argument is negative).

Return type: int or double, depending on type of arguments

Usage notes: Use abs() instead if you need to ensure all return values are positive.

pow(double a, double p), power(double a, double p)

Purpose: Returns the first argument raised to the power of the second argument.

Return type: double

quotient(int numerator, int denominator)

Purpose: Returns the first argument divided by the second argument, discarding any fractional part. Avoids promoting arguments to DOUBLE as happens with the / SQL operator.

Return type: int

radians(double a)

Purpose: Converts argument value from degrees to radians.

Return type: double

rand(), rand(int seed)

Purpose: Returns a random value between 0 and 1. After rand() is called with a seed argument, it produces a consistent random sequence based on the seed value.

Return type: double

Usage notes: Currently, the random sequence is reset after each query, and multiple calls to rand() within the same query return the same value each time. For different number sequences that are different for each query, pass a unique seed value to each call to rand(). For example, select rand(unix_timestamp()) from ...

round(double a), round(double a, int d)

Purpose: Rounds a floating-point value. By default (with a single argument), rounds to the nearest integer. Values ending in .5 are rounded up for positive numbers, down for negative numbers (that is, away from zero). The optional second argument specifies how many digits to leave after the decimal point; values greater than zero produce a floating-point return value rounded to the requested number of digits to the right of the decimal point.

Return type: bigint for single argument; double when second argument greater than zero

sign(double a)

Purpose: Returns -1, 0, or 1 to indicate the signedness of the argument value.

Return type: int

sin(double a)

Purpose: Returns the sine of the argument.

Return type: double

sqrt(double a)

Purpose: Returns the square root of the argument.

Return type: double

tan(double a)

Purpose: Returns the tangent of the argument.

Return type: double

unhex(string a)

Purpose: Returns a string of characters with ASCII values corresponding to pairs of hexadecimal digits in the argument.

Return type: string

Type Conversion Functions

Impala supports the following type conversion functions:

cast(expr as type)

Conversion functions are usually used in combination with other functions, to explicitly pass the expected data types. Impala has strict rules regarding data types for function parameters. For example, Impala does not automatically convert a DOUBLE value to FLOAT, a BIGINT value to INT, or other conversion where precision could be lost or overflow could occur. Use CAST when passing a column value or literal to a function that expects a parameter with a different type. For example:

select concat('Here are the first ',10,' results.'); -- Fails
select concat('Here are the first ',cast(10 as string),' results.'); -- Succeeds

Date and Time Functions

The underlying Impala data type for date and time data is TIMESTAMP, which has both a date and a time portion. Functions that extract a single field, such as hour() or minute(), typically return an integer value. Functions that format the date portion, such as date_add() or to_date(), typically return a string value.

You can also adjust a TIMESTAMP value by adding or subtracting an INTERVAL expression. See TIMESTAMP Data Type for details. INTERVAL expressions are also allowed as the second argument for the date_add() and date_sub() functions, rather than integers.

Impala supports the following data and time functions:

adddate(timestamp startdate, int days, adddate(timestamp startdate, bigint days,

Purpose: Adds a specified number of days to a TIMESTAMP value. Similar to date_add(), but starts with an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.

Return type: timestamp

current_timestamp()

Purpose: Alias for the now() function.

Return type: timestamp

date_add(timestamp startdate, int days), date_add(timestamp startdate, interval_expression)

Purpose: Adds a specified number of days to a TIMESTAMP value. The first argument can be a string, which is automatically cast to TIMESTAMP if it uses the recognized format, as described in TIMESTAMP Data Type. With an INTERVAL expression as the second argument, you can calculate a delta value using other units such as weeks, years, hours, seconds, and so on; see TIMESTAMP Data Type for details.

Return type: timestamp

date_sub(timestamp startdate, int days), date_sub(timestamp startdate, interval_expression)

Purpose: Subtracts a specified number of days from a TIMESTAMP value. The first argument can be a string, which is automatically cast to TIMESTAMP if it uses the recognized format, as described in TIMESTAMP Data Type. With an INTERVAL expression as the second argument, you can calculate a delta value using other units such as weeks, years, hours, seconds, and so on; see TIMESTAMP Data Type for details.

Return type: timestamp

datediff(string enddate, string startdate)

Purpose: Returns the number of days between two dates represented as strings.

Return type: int

day(string date), dayofmonth(string date)

Purpose: Returns the day field from a date represented as a string.

Return type: int

dayname(string date)

Purpose: Returns the day field from a date represented as a string, converted to the string corresponding to that day name. The range of return values is 'Sunday' to 'Saturday'. Used in report-generating queries, as an alternative to calling dayofweek() and turning that numeric return value into a string using a CASE expression.

Return type: string

dayofweek(string date)

Purpose: Returns the day field from a date represented as a string, corresponding to the day of the week. The range of return values is 1 (Sunday) to 7 (Saturday).

Return type: int

dayofyear(timestamp date)

Purpose: Returns the day field from a TIMESTAMP value, corresponding to the day of the year. The range of return values is 1 (January 1) to 366 (December 31 of a leap year).

Return type: int

days_add(timestamp startdate, int days), days_add(timestamp startdate, bigint days)

Purpose: Adds a specified number of days to a TIMESTAMP value. Similar to date_add(), but starts with an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.

Return type: timestamp

days_sub(timestamp startdate, int days), days_sub(timestamp startdate, bigint days)

Purpose: Subtracts a specified number of days from a TIMESTAMP value. Similar to date_sub(), but starts with an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.

Return type: timestamp

from_unixtime(bigint unixtime[, string format])

Purpose: Converts the number of seconds from the Unix epoch to the specified time into a string.

Return type: string

Usage notes: The format string accepts the variations allowed for the TIMESTAMP data type: date plus time, date by itself, time by itself, and optional fractional seconds for the time. See TIMESTAMP Data Type for details.

Currently, the format string is case-sensitive, especially to distinguish m for minutes and M for months. In Impala 1.3 and higher, you can switch the order of elements, use alternative separator characters, and use a different number of placeholders for each unit. Adding more instances of y, d, H, and so on produces output strings zero-padded to the requested number of characters. The exception is M for months, where M produces a non-padded value such as 3, MM produces a zero-padded value such as 03, MMM produces an abbreviated month name such as Mar, and sequences of 4 or more M are not allowed. A date string including all fields could be "yyyy-MM-dd HH:mm:ss.SSSSSS", "dd/MM/yyyy HH:mm:ss.SSSSSS", "MMM dd, yyyy HH.mm.ss (SSSSSS)" or other combinations of placeholders and separator characters.

Note: The more flexible format strings allowed with the built-in functions do not change the rules about using CAST() to convert from a string to a TIMESTAMP value. Strings being casted must still have the elements in the specified order and use the specified delimiter characters, as described in TIMESTAMP Data Type.

Examples:

[localhost:21000] > select from_unixtime(1392394861,"yyyy-MM-dd HH:mm:ss.SSSS");
+-------------------------------------------------------+
| from_unixtime(1392394861, 'yyyy-mm-dd hh:mm:ss.ssss') |
+-------------------------------------------------------+
| 2014-02-14 16:21:01.0000                              |
+-------------------------------------------------------+
[localhost:21000] > select from_unixtime(1392394861,"yyyy-MM-dd");
+-----------------------------------------+
| from_unixtime(1392394861, 'yyyy-mm-dd') |
+-----------------------------------------+
| 2014-02-14                              |
+-----------------------------------------+
[localhost:21000] > select from_unixtime(1392394861,"HH:mm:ss.SSSS");
+--------------------------------------------+
| from_unixtime(1392394861, 'hh:mm:ss.ssss') |
+--------------------------------------------+
| 16:21:01.0000                              |
+--------------------------------------------+
[localhost:21000] > select from_unixtime(1392394861,"HH:mm:ss");
+---------------------------------------+
| from_unixtime(1392394861, 'hh:mm:ss') |
+---------------------------------------+
| 16:21:01                              |
+---------------------------------------+

from_utc_timestamp(timestamp, string timezone)

Purpose: Converts a specified UTC timestamp value into the appropriate value for a specified time zone.

Return type: timestamp

hour(string date)

Purpose: Returns the hour field from a date represented as a string.

Return type: int

hours_add(timestamp date, int hours), hours_add(timestamp date, bigint hours)

Purpose: Returns the specified date and time plus some number of hours.

Return type: timestamp

hours_sub(timestamp date, int hours), hours_sub(timestamp date, bigint hours)

Purpose: Returns the specified date and time minus some number of hours.

Return type: timestamp

microseconds_add(timestamp date, int microseconds), microseconds_add(timestamp date, bigint microseconds)

Purpose: Returns the specified date and time plus some number of microseconds.

Return type: timestamp

microseconds_sub(timestamp date, int microseconds), microseconds_sub(timestamp date, bigint microseconds)

Purpose: Returns the specified date and time minus some number of microseconds.

Return type: timestamp

milliseconds_add(timestamp date, int milliseconds), milliseconds_add(timestamp date, bigint milliseconds)

Purpose: Returns the specified date and time plus some number of milliseconds.

Return type: timestamp

milliseconds_sub(timestamp date, int milliseconds), milliseconds_sub(timestamp date, bigint milliseconds)

Purpose: Returns the specified date and time minus some number of milliseconds.

Return type: timestamp

minute(string date)

Purpose: Returns the minute field from a date represented as a string.

Return type: int

minutes_add(timestamp date, int minutes), minutes_add(timestamp date, bigint minutes)

Purpose: Returns the specified date and time plus some number of minutes.

Return type: timestamp

minutes_sub(timestamp date, int minutes), minutes_sub(timestamp date, bigint minutes)

Purpose: Returns the specified date and time minus some number of minutes.

Return type: timestamp

month(string date)

Purpose: Returns the month field from a date represented as a string.

Return type: int

months_add(timestamp date, int months), months_add(timestamp date, bigint months)

Purpose: Returns the specified date and time plus some number of months.

Return type: timestamp

months_sub(timestamp date, int months), months_sub(timestamp date, bigint months)

Purpose: Returns the specified date and time minus some number of months.

Return type: timestamp

nanoseconds_add(timestamp date, int nanoseconds), nanoseconds_add(timestamp date, bigint nanoseconds)

Purpose: Returns the specified date and time plus some number of nanoseconds.

Return type: timestamp

nanoseconds_sub(timestamp date, int nanoseconds), nanoseconds_sub(timestamp date, bigint nanoseconds)

Purpose: Returns the specified date and time minus some number of nanoseconds.

Return type: timestamp

now()

Purpose: Returns the current date and time (in the UTC time zone) as a timestamp value.

Return type: timestamp

Usage notes: To find a date/time value in the future or the past relative to the current date and time, add or subtract an INTERVAL expression to the return value of now(). See TIMESTAMP Data Type for examples.

second(string date)

Purpose: Returns the second field from a date represented as a string.

Return type: int

seconds_add(timestamp date, int seconds), seconds_add(timestamp date, bigint seconds)

Purpose: Returns the specified date and time plus some number of seconds.

Return type: timestamp

seconds_sub(timestamp date, int seconds), seconds_sub(timestamp date, bigint seconds)

Purpose: Returns the specified date and time minus some number of seconds.

Return type: timestamp

subdate(timestamp startdate, int days, subdate(timestamp startdate, bigint days,

Return type: timestamp

to_date(timestamp)

Purpose: Returns a string representation of the date field from a timestamp value.

Return type: string

to_utc_timestamp(timestamp, string timezone)

Purpose: Converts a specified timestamp value in a specified time zone into the corresponding value for the UTC time zone.

Return type: timestamp

unix_timestamp(), unix_timestamp(string date), unix_timestamp(string date, string format)

Purpose: Returns a timestamp representing the current date and time, or converts from a specified date and time value represented as a string.

Return type: bigint

Usage notes: See from_unixtime() for details about the patterns you can use in the format string to represent the position of year, month, day, and so on in the date string. In Impala 1.3 and higher, you have more flexibility to switch the positions of elements and use different separator characters.

weekofyear(string date)

Purpose: Returns the corresponding week (1-53) from a date represented as a string.

Return type: int

weeks_add(timestamp date, int weeks), weeks_add(timestamp date, bigint weeks)

Purpose: Returns the specified date and time plus some number of weeks.

Return type: timestamp

weeks_sub(timestamp date, int weeks), weeks_sub(timestamp date, bigint weeks)

Purpose: Returns the specified date and time minus some number of weeks.

Return type: timestamp

year(string date)

Purpose: Returns the year field from a date represented as a string.

Return type: int

years_add(timestamp date, int years), years_add(timestamp date, bigint years)

Purpose: Returns the specified date and time plus some number of years.

Return type: timestamp

years_sub(timestamp date, int years), years_sub(timestamp date, bigint years)

Purpose: Returns the specified date and time minus some number of years.

Return type: timestamp

Conditional Functions

Impala supports the following conditional functions for testing equality, comparison operators, and nullity:

CASE a WHEN b THEN c [WHEN d THEN e]... [ELSE f] END

Purpose: Compares an expression to one or more possible values, and returns a corresponding result when a match is found.

CASE WHEN a THEN b [WHEN c THEN d]... [ELSE e] END

Purpose: Tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression.

coalesce(type v1, type v2, ...)

Purpose: Returns the first specified argument that is not NULL, or NULL if all arguments are NULL.

if(boolean condition, type ifTrue, type ifFalseOrNull)

Purpose: Tests an expression and returns a corresponding result depending on whether the result is true, false, or NULL.

Return type: same as the ifTrue argument value

ifnull(type a, type ifNotNull)

Purpose: Alias for the isnull() function, with the same behavior. To simplify porting SQL with vendor extensions to Impala.

Added in: Impala 1.3.0

isnull(type a, type ifNotNull)

Purpose: Tests if an expression is NULL, and returns the expression result value if not. If the first argument is NULL, returns the second argument.

Compatibility notes: Equivalent to the nvl() function from Oracle Database or ifnull() from MySQL. The nvl() and ifnull() functions are also available in Impala.

Return type: same as the first argument value

nullif(expr1,expr2)

Purpose: Returns NULL if the two specified arguments are equal. If the specified arguments are not equal, returns the value of expr1. The data types of the expressions must be compatible, according to the conversion rules from Data Types. You cannot use an expression that evaluates to NULL for expr1; that way, you can distinguish a return value of NULL from an argument value of NULL, which would never match expr2.

Usage notes: This function is effectively shorthand for a CASE expression of the form:

CASE
  WHEN expr1 = expr2 THEN NULL
  ELSE expr1
END

It is commonly used in division expressions, to produce a NULL result instead of a divide-by-zero error when the divisor is equal to zero:

select 1.0 / nullif(c1,0) as reciprocal from t1;

You might also use it for compatibility with other database systems that support the same NULLIF() function.

Added in: Impala 1.3.0

nullifzero(numeric_expr)

Purpose: Returns NULL if the numeric expression evaluates to 0, otherwise returns the result of the expression.

Usage notes: Used to avoid error conditions such as divide-by-zero in numeric calculations. Serves as shorthand for a more elaborate CASE expression, to simplify porting SQL with vendor extensions to Impala.

Added in: Impala 1.3.0

nvl(type a, type ifNotNull)

Purpose: Alias for the isnull() function. Tests if an expression is NULL, and returns the expression result value if not. If the first argument is NULL, returns the second argument. Equivalent to the nvl() function from Oracle Database or ifnull() from MySQL.

Return type: same as the first argument value

Added in: Impala 1.1

zeroifnull(numeric_expr)

Purpose: Returns 0 if the numeric expression evaluates to NULL, otherwise returns the result of the expression.

Usage notes: Used to avoid unexpected results due to unexpected propagation of NULL values in numeric calculations. Serves as shorthand for a more elaborate CASE expression, to simplify porting SQL with vendor extensions to Impala.

Added in: Impala 1.3.0

String Functions

Impala supports the following string functions:

ascii(string str)

Purpose: Returns the numeric ASCII code of the first character of the argument.

Return type: int

char_length(string a), character_length(string a)

Purpose: Returns the length in characters of the argument string. Aliases for the length() function.

Return type: int

concat(string a, string b...)

Purpose: Returns a single string representing all the argument values joined together.

Return type: string

Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.

concat_ws(string sep, string a, string b...)

Purpose: Returns a single string representing the second and following argument values joined together, delimited by a specified separator.

Return type: string

Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.

find_in_set(string str, string strList)

Purpose: Returns the position (starting from 1) of the first occurrence of a specified string within a comma-separated string. Returns NULL if either argument is NULL, 0 if the search string is not found, or 0 if the search string contains a comma.

Return type: int

group_concat(string s [, string sep])

Purpose: Returns a single string representing the argument value concatenated together for each row of the result set. If the optional separator string is specified, the separator is added between each pair of concatenated values.

Return type: string

Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.

By default, returns a single string covering the whole result set. To include other columns or values in the result set, or to produce multiple concatenated strings for subsets of rows, include a GROUP BY clause in the query.

initcap(string str)

Purpose: Returns the input string with the first letter capitalized.

Return type: string

instr(string str, string substr)

Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string.

Return type: int

length(string a)

Purpose: Returns the length in characters of the argument string.

Return type: int

locate(string substr, string str[, int pos])

Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer string, optionally after a particular position.

Return type: int

lower(string a), lcase(string a)

Purpose: Returns the argument string converted to all-lowercase.

Return type: string

lpad(string str, int len, string pad)

Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the left with a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.

Return type: string

ltrim(string a)

Purpose: Returns the argument string with any leading spaces removed from the left side.

Return type: string

parse_url(string urlString, string partToExtract [, string keyToExtract])

Purpose: Returns the portion of a URL corresponding to a specified part. The part argument can be 'PROTOCOL', 'HOST', 'PATH', 'REF', 'AUTHORITY', 'FILE', 'USERINFO', or 'QUERY'. Uppercase is required for these literal values. When requesting the QUERY portion of the URL, you can optionally specify a key to retrieve just the associated value from the key-value pairs in the query string.

Return type: string

Usage notes: This function is important for the traditional Hadoop use case of interpreting web logs. For example, if the web traffic data features raw URLs not divided into separate table columns, you can count visitors to a particular page by extracting the 'PATH' or 'FILE' field, or analyze search terms by extracting the corresponding key from the 'QUERY' field.

regexp_extract(string subject, string pattern, int index)

Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0 refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so on (...) portion.

Return type: string

The Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Boost library. For details, see the Boost documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on. It does not support .*? for non-greedy matches.

Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell. You might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you would have to escape as \\d.

Examples:

This example shows how group 0 matches the full pattern string, including the portion outside any () group:

[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*(\\d+)',0);
+-----------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*(\\d+)', 0) |
+-----------------------------------------------------+
| abcdef123ghi456                                     |
+-----------------------------------------------------+
Returned 1 row(s) in 0.11s

This example shows how group 1 matches just the contents inside the first () group in the pattern string:

[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*(\\d+)',1);
+-----------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*(\\d+)', 1) |
+-----------------------------------------------------+
| 456                                                 |
+-----------------------------------------------------+
Returned 1 row(s) in 0.11s

The Boost regular expression syntax does not support the .*? idiom for non-greedy matches. This example shows how a pattern string starting with .* matches the longest possible portion of the source string, effectively serving as a greedy match and returning the rightmost set of lowercase letters. A pattern string both starting and ending with .* finds two potential matches of equal length, and returns the first one found (the leftmost set of lowercase letters), effectively serving as a non-greedy match.

[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*([[:lower:]]+)',1);
+-------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*([[:lower:]]+)', 1) |
+-------------------------------------------------------+
| def                                                   |
+-------------------------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*([[:lower:]]+).*',1);
+---------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*([[:lower:]]+).*', 1) |
+---------------------------------------------------------+
| bcd                                                     |
+---------------------------------------------------------+
Returned 1 row(s) in 0.11s

regexp_replace(string initial, string pattern, string replacement)

Purpose: Returns the initial argument with the regular expression pattern replaced by the final argument string.

Return type: string

Examples:

These examples show how you can replace parts of a string matching a pattern with replacement text, which can include backreferences to any () groups in the pattern string. The backreference numbers start at 1, and any \ characters must be escaped as \\.

Replace a character pattern with new text:

[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz');
+------------------------------------------+
| regexp_replace('aaabbbaaa', 'b+', 'xyz') |
+------------------------------------------+
| aaaxyzaaa                                |
+------------------------------------------+
Returned 1 row(s) in 0.11s

Replace a character pattern with substitution text that includes the original matching text:

[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>');
+----------------------------------------------+
| regexp_replace('aaabbbaaa', '(b+)', '<\\1>') |
+----------------------------------------------+
| aaa<bbb>aaa                                  |
+----------------------------------------------+
Returned 1 row(s) in 0.11s

Remove all characters that are not digits:

[localhost:21000] > select regexp_replace('123-456-789','[^[:digit:]]','');
+---------------------------------------------------+
| regexp_replace('123-456-789', '[^[:digit:]]', '') |
+---------------------------------------------------+
| 123456789                                         |
+---------------------------------------------------+
Returned 1 row(s) in 0.12s

repeat(string str, int n)

Purpose: Returns the argument string repeated a specified number of times.

Return type: string

reverse(string a)

Purpose: Returns the argument string with characters in reversed order.

Return type: string

rpad(string str, int len, string pad)

Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the right with a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.

Return type: string

rtrim(string a)

Purpose: Returns the argument string with any trailing spaces removed from the right side.

Return type: string

space(int n)

Purpose: Returns a concatenated string of the specified number of spaces. Shorthand for repeat(' ',n).

Return type: string

strleft(string a, int num_chars)

Purpose: Returns the leftmost characters of the string. Shorthand for a call to substr() with 2 arguments.

Return type: string

strright(string a, int num_chars)

Purpose: Returns the rightmost characters of the string. Shorthand for a call to substr() with 2 arguments.

Return type: string

substr(string a, int start [, int len]), substring(string a, int start [, int len])

Purpose: Returns the portion of the string starting at a specified point, optionally with a specified maximum length. The characters in the string are indexed starting at 1.

Return type: string

translate(string input, string from, string to)

Purpose: Returns the input string with a set of characters replaced by another set of characters.

Return type: string

trim(string a)

Purpose: Returns the input string with both leading and trailing spaces removed. The same as passing the string through both ltrim() and rtrim().

Return type: string

upper(string a), ucase(string a)

Purpose: Returns the argument string converted to all-uppercase.

Return type: string

Miscellaneous Functions

Impala supports the following utility functions that do not operate on a particular column or data type:

current_database(): Purpose: Returns the database that the session is currently using, either default if no database has been selected, or whatever database the session switched to through a USE statement or the impalad -d option.
Return type: string
pid(): Purpose: Returns the process ID of the impalad daemon that the session is connected to. You can use it during low-level debugging, to issue Linux commands that trace, show the arguments, and so on the impalad process.
Return type: int
user(): Purpose: Returns the username of the Linux user who is connected to the impalad daemon. Typically called a single time, in a query without any FROM clause, to understand how authorization settings apply in a security context; once you know the logged-in user name, you can check which groups that user belongs to, and from the list of groups you can check which roles are available to those groups through the authorization policy file.
Return type: string
version(): Purpose: Returns information such as the precise version number and build date for the impalad daemon that you are currently connected to. Typically used to confirm that you are connected to the expected level of Impala to use a particular feature, or to connect to several nodes and confirm they are all running the same level of impalad.
Return type: string (with one or more embedded newlines)

Page generated September 3, 2015.