This is the documentation for Cloudera Impala 1.2.4.
Documentation for other versions is available at Cloudera Documentation.

Enabling Sentry Authorization for Impala

Authorization determines which users are allowed to access which resources, and what operations they are allowed to perform. In Impala 1.1 and higher, you use the Sentry open source project for authorization. Sentry adds a fine-grained authorization framework for Hadoop. By default (when authorization is not enabled), Impala does all read and write operations with the privileges of the impala user, which is suitable for a development/test environment but not for a secure production environment. When authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or other client program, and associates various privileges with each user.

Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.

The object hierarchy covers Server, URI, Database, and Table. Currently, you cannot assign privileges at the partition or column level.

A restricted set of privileges determines what you can do with each object:

SELECT privilege
Lets you read data from a table, for example with the SELECT statement, the INSERT...SELECT syntax, or CREATE TABLE...LIKE. Also required to issue the DESCRIBE statement or the EXPLAIN statement for a query against a particular table. Only objects for which a user has this privilege are shown in the output for SHOW DATABASES and SHOW TABLES statements. The REFRESH statement and INVALIDATE METADATA statements only access metadata for tables for which the user has this privilege.
INSERT privilege
Lets you write data to a table. Applies to the INSERT and LOAD DATA statements.
ALL privilege
Lets you create or modify the object. Required to run DDL statements such as CREATE TABLE, ALTER TABLE, or DROP TABLE for a table, CREATE DATABASE or DROP DATABASE for a database, or CREATE VIEW, ALTER VIEW, or DROP VIEW for a view. Also required for the URI of the "location" parameter for the CREATE EXTERNAL TABLE and LOAD DATA statements.

Privileges can be specified for a table before that table actually exists. If you do not have sufficient privilege to perform an operation, the error message does not disclose if the object exists or not.

Privileges are encoded in a policy file, stored in HDFS. (Currently, there is no GRANT or REVOKE statement; you cannot adjust privileges through SQL statements.) The location is listed in the auth-site.xml configuration file. To minimize overhead, the security information from this file is cached by each impalad daemon and refreshed automatically, with a default interval of 5 minutes. After making a substantial change to security policies, restart all Impala daemons to pick up the changes immediately.

See the following sections for details about using the Impala authorization features:

Setting Up the Policy File for Impala Security

The policy file is a file that you put in a designated location in HDFS, and is read during the startup of the impalad daemon when you specify the -server_name and -authorization_policy_file startup options. It controls which objects (databases, tables, and HDFS directory paths) can be accessed by the user who connects to impalad, and what operations that user can perform on the objects.

The policy file uses the familiar .ini format, divided into the major sections [groups] and [roles]. There is also an optional [databases] section, which allows you to specify a specific policy file for a particular database, as explained in Using Multiple Policy Files for Different Databases. Another optional section, [users], allows you to override the OS-level mapping of users to groups; that is an advanced technique primarily for testing and debugging, and is beyond the scope of this document.

In the [groups] section, you define various categories of users and select which roles are associated with each category. The group and user names correspond to Linux groups and users on the server where the impalad daemon runs.

The group and user names in the [groups] section correspond to Linux groups and users on the server where the impalad daemon runs. When you access Impala through the impalad interpreter, for purposes of authorization, the user is the logged-in Linux user and the groups are the Linux groups that user is a member of. When you access Impala through the ODBC or JDBC interfaces, the user and password specified through the connection string are used as login credentials for the Linux server, and authorization is based on that user name and the associated Linux group membership.

In the [roles] section, you a set of roles. For each role, you specify precisely the set of privileges is available. That is, which objects users with that role can access, and what operations they can perform on those objects. This is the lowest-level category of security information; the other sections in the policy file map the privileges to higher-level divisions of groups and users. In the [groups] section, you specify which roles are associated with which groups. The group and user names correspond to Linux groups and users on the server where the impalad daemon runs. The privileges are specified using patterns like:
server=server_name->db=database_name->table=table_name->action=SELECT
server=server_name->db=database_name->table=table_name->action=CREATE
server=server_name->db=database_name->table=table_name->action=ALL
For the server_name value, substitute the same symbolic name you specify with the impalad -server_name option. You can use * wildcard characters at each level of the privilege specification to allow access to all such objects. For example:
server=impala-host.example.com->db=default->table=t1->action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com->db=*->table=audit_log->action=SELECT
server=impala-host.example.com->db=default->table=t1->action=*

When authorization is enabled, Impala uses the policy file as a whitelist, representing every privilege available to any user on any object. That is, only operations specified for the appropriate combination of object, role, group, and user are allowed; all other operations are not allowed. If a group or role is defined multiple times in the policy file, the last definition takes precedence.

To understand the notion of whitelisting, set up a minimal policy file that does not provide any privileges for any object. When you connect to an Impala node where this policy file is in effect, you get no results for SHOW DATABASES, and an error when you issue any SHOW TABLES, USE database_name, DESCRIBE table_name, SELECT, and or other statements that expect to access databases or tables, even if the corresponding databases and tables exist.

The contents of the policy file are cached, to avoid a performance penalty for each query. The policy file is re-checked by each impalad node every 5 minutes. When you make a non-time-sensitive change such as adding new privileges or new users, you can let the change take effect automatically a few minutes later. If you remove or reduce privileges, and want the change to take effect immediately, restart the impalad daemon on all nodes, again specifying the -server_name and -authorization_policy_file options so that the rules from the updated policy file are applied.

Secure Startup for the impalad Daemon

To run the impalad daemon with authorization enabled, you add two options to the IMPALA_SERVER_ARGS declaration in the /etc/default/impala configuration file. The -authorization_policy_file option specifies the HDFS path to the policy file that defines the privileges on different schema objects. The rules in the policy file refer to a symbolic server name, and you specify a matching name as the argument to the -server_name option of impalad.

For example, you might adapt your /etc/default/impala configuration to contain lines like the following:

IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...

Then, the rules in the [roles] section of the policy file would refer to this same server1 name. For example, the following rule sets up a role report_generator that lets users with that role query any table in a database named reporting_db on a node where the impalad daemon was started up with the -server_name=server1 option:

[roles]
report_generator = server=server1->db=reporting_db->table=*->action=SELECT

If the impalad daemon is not already running, start it as described in Starting Impala. If it is already running, restart it with the command sudo /etc/init.d/impala-server restart. Run the appropriate commands on all the nodes where impalad normally runs.

When impalad is started with one or both of the -server_name=server1 and -authorization_policy_file options, Impala authorization is enabled. If Impala detects any errors or inconsistencies in the authorization settings or the policy file, the daemon refuses to start.

Examples of Policy File Rules for Security Scenarios

The following examples show rules that might go in the policy file to deal with various authorization-related scenarios. For illustration purposes, this section shows several very small policy files with only a few rules each. In your environment, typically you would define many roles to cover all the scenarios involving your own databases, tables, and applications, and a smaller number of groups, whose members are given the privileges from one or more roles.

A User with No Privileges

If a user has no privileges at all, that user cannot access any schema objects in the system. The error messages do not disclose the names or existence of objects that the user is not authorized to read.

This is the experience you want a user to have if they somehow log into a system where they are not an authorized Impala user. In a real deployment with a filled-in policy file, a user might have no privileges because they are not a member of any of the relevant groups mentioned in the policy file.

Examples of Privileges for Administrative Users

When an administrative user has broad access to tables or databases, the associated rules in the [roles] section typically use wildcards and/or inheritance. For example, in the following sample policy file, db=* refers to all databases and db=*->table=* refers to all tables in all databases.

Omitting the rightmost portion of a rule means that the privileges apply to all the objects that could be specified there. For example, in the following sample policy file, the all_databases role has all privileges for all tables in all databases, while the one_database role has all privileges for all tables in one specific database. The all_databases role does not grant privileges on URIs, so a group with that role could not issue a CREATE TABLE statement with a LOCATION clause. The entire_server role has all privileges on both databases and URIs within the server.

[groups]
supergroup = all_databases

[roles]
read_all_tables = server=server1->db=*->table=*->action=SELECT
all_tables = server=server1->db=*->table=*
all_databases = server=server1->db=*
one_database = server=server1->db=test_db
entire_server = server=server1

A User with Privileges for Specific Databases and Tables

If a user has privileges for specific tables in specific databases, the user can access those things but nothing else. They can see the tables and their parent databases in the output of SHOW TABLES and SHOW DATABASES, USE the appropriate databases, and perform the relevant actions (SELECT and/or INSERT) based on the table privileges. To actually create a table requires the ALL privilege at the database level, so you might define separate roles for the user that sets up a schema and other users or applications that perform day-to-day operations on the tables.

The following sample policy file shows some of the syntax that is appropriate as the policy file grows, such as the # comment syntax, \ continuation syntax, and comma separation for roles assigned to groups or privileges assigned to roles.

[groups]
cloudera = training_sysadmin, instructor
visitor = student

[roles]
training_sysadmin = server=server1->db=training, \
server=server1->db=instructor_private, \
server=server1->db=lesson_development
instructor = server=server1->db=training->table=*->action=*, \
server=server1->db=instructor_private->table=*->action=*, \
server=server1->db=lesson_development->table=lesson*
# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
student = server=server1->db=training->table=lesson_*->action=SELECT

Privileges for Working with External Data Files

When data is being inserted through the LOAD DATA statement, or is referenced from an HDFS location outside the normal Impala database directories, the user also needs appropriate permissions on the URIs corresponding to those HDFS locations.

In this sample policy file:

  • The external_table role lets us insert into and query the Impala table, external_table.sample.
  • The staging_dir role lets us specify the HDFS path /user/cloudera/external_data with the LOAD DATA statement. Remember, when Impala queries or loads data files, it operates on all the files in that directory, not just a single file, so any Impala LOCATION parameters refer to a directory rather than an individual file.
  • We included the IP address and port of the Hadoop name node in the HDFS URI of the staging_dir rule. We found those details in /etc/hadoop/conf/core-site.xml, under the fs.default.name element. That is what we use in any roles that specify URIs (that is, the locations of directories in HDFS).
  • We start this example after the table external_table.sample is already created. In the policy file for the example, we have already taken away the external_table_admin role from the cloudera group, and replaced it with the lesser-privileged external_table role.
  • We assign privileges to a subdirectory underneath /user/cloudera in HDFS, because such privileges also apply to any subdirectories underneath. If we had assigned privileges to the parent directory /user/cloudera, it would be too likely to mess up other files by specifying a wrong location by mistake.
  • The cloudera under the [groups] section refers to the cloudera group. (In the demo VM used for this example, there is a cloudera user that is a member of a cloudera group.)

Policy file:

[groups]
cloudera = external_table, staging_dir

[roles]
external_table_admin = server=server1->db=external_table
external_table = server=server1->db=external_table->table=sample->action=*
staging_dir = server=server1->uri=hdfs://127.0.0.1:8020/user/cloudera/external_data->action=*

impala-shell session:

[localhost:21000] > use external_table;
Query: use external_table
[localhost:21000] > show tables;
Query: show tables
Query finished, fetching results ...
+--------+
| name   |
+--------+
| sample |
+--------+
Returned 1 row(s) in 0.02s

[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-----+
| x   |
+-----+
| 1   |
| 5   |
| 150 |
+-----+
Returned 3 row(s) in 1.04s

[localhost:21000] > load data inpath '/user/cloudera/external_data' into table sample;
Query: load data inpath '/user/cloudera/external_data' into table sample
Query finished, fetching results ...
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.26s
[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-------+
| x     |
+-------+
| 2     |
| 4     |
| 6     |
| 8     |
| 64738 |
| 49152 |
| 1     |
| 5     |
| 150   |
+-------+
Returned 9 row(s) in 0.22s

[localhost:21000] > load data inpath '/user/cloudera/unauthorized_data' into table sample;
Query: load data inpath '/user/cloudera/unauthorized_data' into table sample
ERROR: AuthorizationException: User 'cloudera' does not have privileges to access: hdfs://127.0.0.1:8020/user/cloudera/unauthorized_data

Controlling Access at the Column Level through Views

If a user has SELECT privilege for a view, they can query the view, even if they do not have any privileges on the underlying table. To see the details about the underlying table through EXPLAIN or DESCRIBE FORMATTED statements on the view, the user must also have SELECT privilege for the underlying table.

  Important:

The types of data that are considered sensitive and confidential differ depending on the jurisdiction the type of industry, or both. For fine-grained access controls, set up appropriate privileges based on all applicable laws and regulations.

Be careful using the ALTER VIEW statement to point an existing view at a different base table or a new set of columns that includes sensitive or restricted data. Make sure that any users who have SELECT privilege on the view do not gain access to any additional information they are not authorized to see.

The following example shows how a system administrator could set up a table containing some columns with sensitive information, then create a view that only exposes the non-confidential columns.

[localhost:21000] > create table sensitive_info
                > (
                >   name string,
                >   address string,
                >   credit_card string,
                >   taxpayer_id string
                > );
[localhost:21000] > create view name_address_view as select name, address from sensitive_info;

Then the following policy file specifies read-only privilege for that view, without authorizing access to the underlying table:

[groups]
cloudera = view_only_privs

[roles]
view_only_privs = server=server1->db=reports->table=name_address_view->action=SELECT

Thus, a user with the view_only_privs role could access through Impala queries the basic information but not the sensitive information, even if both kinds of information were part of the same data file.

You might define other views to allow users from different groups to query different sets of columns.

The DEFAULT Database in a Secure Deployment

Because of the extra emphasis on granular access controls in a secure deployment, you should move any important or sensitive information out of the DEFAULT database into a named database whose privileges are specified in the policy file. Sometimes you might need to give privileges on the DEFAULT database for administrative reasons; for example, as a place you can reliably specify with a USE statement when preparing to drop a database.

Separating Administrator Responsibility from Read and Write Privileges

Remember that to create a database requires full privilege on that database, while day-to-day operations on tables within that database can be performed with lower levels of privilege on specific table. Thus, you might set up separate roles for each database or application: an administrative one that could create or drop the database, and a user-level one that can access only the relevant tables.

For example, this policy file divides responsibilities between users in 3 different groups:

  • Members of the supergroup group have the training_sysadmin role and so can set up a database named training.
  • Members of the cloudera group have the instructor role and so can create, insert into, and query any tables in the training database, but cannot create or drop the database itself.
  • Members of the visitor group have the student role and so can query those tables in the training database.
[groups]
supergroup = training_sysadmin
cloudera = instructor
visitor = student

[roles]
training_sysadmin = server=server1->db=training
instructor = server=server1->db=training->table=*->action=*
student = server=server1->db=training->table=*->action=SELECT

Setting Up Schema Objects for a Secure Impala Deployment

Remember that in the [roles] section of the policy file, you specify privileges at the level of individual databases and tables, or all databases or all tables within a database. To simplify the structure of these rules, plan ahead of time how to name your schema objects so that data with different authorization requirements is divided into separate databases.

If you are adding security on top of an existing Impala deployment, remember that you can rename tables or even move them between databases using the ALTER TABLE statement. In Impala, creating new databases is a relatively inexpensive operation, basically just creating a new directory in HDFS.

You can also plan the security scheme and set up the policy file before the actual schema objects named in the policy file exist. Because the authorization capability is based on whitelisting, a user can only create a new database or table if the required privilege is already in the policy file: either by listing the exact name of the object being created, or a * wildcard to match all the applicable objects within the appropriate container.

Privilege Model and Object Hierarchy

Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.

The kinds of objects in the schema hierarchy are:

Server
URI
Database
  Table

The server name is specified by the -server_name option when impalad starts. Specify the same name for all impalad nodes in the cluster.

URIs represent the HDFS paths you specify as part of statements such as CREATE EXTERNAL TABLE and LOAD DATA. Typically, you specify what look like UNIX paths, but these locations can also be prefixed with hdfs:// to make clear that they are really URIs. To set privileges for a URI, specify the name of a directory, and the privilege applies to all the files in that directory and any directories underneath it.

There are not separate privileges for individual table partitions or columns. To specify read privileges at this level, you create a view that queries specific columns and/or partitions from a base table, and give SELECT privilege on the view but not the underlying table. See Views for details about views in Impala.

URIs must start with either hdfs:// or file://. If a URI starts with anything else, it will cause an exception and the policy file will be invalid. When defining URIs for HDFS, you must also specify the NameNode. For example:
data_read = server=server1->uri=file:///path/to/dir, \
server=server1->uri=hdfs://namenode:port/path/to/dir
  Warning:

Because the NameNode host and port must be specified, Cloudera strongly recommends you use High Availability (HA). This ensures that the URI will remain constant even if the namenode changes.

data_read = server=server1->uri=file:///path/to/dir,\ server=server1->uri=hdfs://ha-nn-uri/path/to/dir
Table 1. Valid Privilege types and objects they apply to
Privilege Object
INSERT TABLE, URI
SELECT TABLE, VIEW, URI
ALL SERVER, DB, URI
  Note:

Although this document refers to the ALL privilege, currently, you do not use the actual keyword ALL. in the policy file. When you code role entries in the policy file:

  • To specify the ALL privilege for a server, use a role like server=server_name.
  • To specify the ALL privilege for a database, use a role like server=server_name->db=database_name.
  • To specify the ALL privilege for a table, use a role like server=server_name->db=database_name->table=table_name->action=*.
Table 2. Privilege table for Impala SQL operations
Impala SQL Operation Privileges Required Object on which Privileges Required
EXPLAIN SELECT Table
LOAD DATA INSERT, SELECT Table (INSERT) and URI (ALL); write privilege is required on the URI because the data files are physically moved from there
CREATE DATABASE ALL Database
DROP DATABASE ALL Database
DROP TABLE ALL Table
DESCRIBE TABLE SELECT or INSERT Table
ALTER TABLE ALL Table
SHOW DATABASES Any privilege Any object in the database; only databases for which the user has privileges on some object are shown
SHOW TABLES Any privilege Any table in the database; only tables for which the user has privileges are shown
CREATE VIEW ALL, SELECT You need ALL privilege on the named view, plus SELECT privilege for any tables or views referenced by the view query. Once the view is created or altered by a high-privileged system administrator, it can be queried by a lower-privileged user who does not have full query privileges for the base tables. (This is how you implement column-level security.)
DROP VIEW ALL Table
ALTER VIEW ALL, SELECT You need ALL privilege on the named view, plus SELECT privilege for any tables or views referenced by the view query. Once the view is created or altered by a high-privileged system administrator, it can be queried by a lower-privileged user who does not have full query privileges for the base tables. (This is how you implement column-level security.)
ALTER TABLE LOCATION ALL Table, URI
CREATE TABLE ALL Database
CREATE EXTERNAL TABLE ALL, SELECT Database (ALL), URI (SELECT)
SELECT SELECT Table, View; you can have SELECT privilege for a view without having SELECT privilege for the underlying tables, which allows a system administrator to implement column-level security by creating views that reference particular sets of columns
USE Any privilege Any object in the database
CREATE FUNCTION ALL Server
DROP FUNCTION ALL Server
REFRESH ALL Table
INVALIDATE METADATA ALL Server
COMPUTE STATS ALL Table

Using Multiple Policy Files for Different Databases

For an Impala cluster with many databases being accessed by many users and applications, it might be cumbersome to update the security policy file for each privilege change or each new database, table, or view. You can allow security to be managed separately for individual databases, by setting up a separate policy file for each database:

  • Add the optional [databases] section to the main policy file.
  • Add entries in the [databases] section for each database that has its own policy file.
  • For each listed database, specify the HDFS path of the appropriate policy file.

For example:

[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini

Debugging Failed Sentry Authorization Requests

Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why Sentry is denying access, the best way to debug is to temporarily turn on debug logging:
  • In Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the logging settings for your service through the corresponding Logging Safety Valve field for the Impala, Hive Server 2, or Solr Server services.
  • On systems not managed by Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as:
FilePermission server..., RequestPermission server...., result [true|false]
which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating "Access Denied".

Configuring Per-User Access for Hue

When users connect to Impala directly through the impala-shell interpreter, the Impala authorization feature determines what actions they can take and what data they can see. When users submit Impala queries through a separate application, such as Hue, typically all requests are treated as coming from the same user. In Impala 1.2 and higher, authorization is extended by a new feature that allows applications to pass along credentials for the users that connect to them, and issue Impala queries with the privileges for those users. This feature is known as "impersonation". Currently, the impersonation feature is available only for Impala queries submitted through the Hue interface; for example, Impala cannot impersonate the HDFS user.

Impala 1.2 adds a new startup option for impalad, --authorized_proxy_user_config. When you specify this option, users whose names you specify (such as hue) can impersonate another user. The name of the user whose privileges are used is passed using the HiveServer2 configuration property impala.doas.user.

You can specify a list of users that the application user can impersonate, or * to allow a superuser to impersonate any other user. For example:

impalad --authorized_proxy_user_config 'hue=user1,user2;admin=*' ...
  Note: Make sure to use single quotes or escape characters to ensure that any * characters do not undergo wildcard expansion when specified in command-line arguments.

See Modifying Impala Startup Options for details about adding or changing impalad startup options. See this Cloudera blog post for background information about the impersonation capability in HiveServer2.