APPX_COUNT_DISTINCT Query Option (CDH 5.2 or higher only)

Allows multiple COUNT(DISTINCT) operations within a single query, by internally rewriting each COUNT(DISTINCT) to use the NDV() function. The resulting count is approximate rather than precise.

Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false

Default: false (shown as 0 in output of SET statement)

Examples:

The following examples show how the APPX_COUNT_DISTINCT lets you work around the restriction where a query can only evaluate COUNT(DISTINCT col_name) for a single column. By default, you can count the distinct values of one column or another, but not both in a single query:

[localhost:21000] > select count(distinct x) from int_t;
+-------------------+
| count(distinct x) |
+-------------------+
| 10                |
+-------------------+
[localhost:21000] > select count(distinct property) from int_t;
+--------------------------+
| count(distinct property) |
+--------------------------+
| 7                        |
+--------------------------+
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
as count(DISTINCT x); deviating function: count(DISTINCT property)

When you enable the APPX_COUNT_DISTINCT query option, now the query with multiple COUNT(DISTINCT) works. The reason this behavior requires a query option is that each COUNT(DISTINCT) is rewritten internally to use the NDV() function instead, which provides an approximate result rather than a precise count.

[localhost:21000] > set APPX_COUNT_DISTINCT=true;
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
+-------------------+--------------------------+
| count(distinct x) | count(distinct property) |
+-------------------+--------------------------+
| 10                | 7                        |
+-------------------+--------------------------+

Related information:

COUNT Function, DISTINCT Operator, NDV Function