Overview of Data Management Mechanisms for an Enterprise Data Hub

For the data in the cluster, it is critical to understand where the data is coming from and how it's being used. The goal of auditing is to capture a complete and immutable record of all activity within a system. Auditing plays a central role in three key activities within the enterprise:
  • First, auditing is part of a system’s security regime and can explain what happened, when, and to whom or what in case of a breach or other malicious intent. For example, if a rogue administrator deletes a user’s data set, auditing provides the details of this action, and the correct data may be retrieved from backup.

  • The second activity is compliance, and auditing participates in satisfying the core requirements of regulations associated with sensitive or personally identifiable data (PII), such as the Health Insurance Portability and Accountability Act (HIPAA) or the Payment Card Industry (PCI) Data Security Standard. Auditing provides the touchpoints necessary to construct the trail of who, how, when, and how often data is produced, viewed, and manipulated.

  • Lastly, auditing provides the historical data and context for data forensics. Audit information leads to the understanding of how various populations use different data sets and can help establish the access patterns of these data sets. This examination, such as trend analysis, is broader in scope than compliance and can assist content and system owners in their data optimization efforts.

The risks facing auditing are the reliable, timely, and tamper-proof capture of all activity, including administrative actions. Until recently, the native Hadoop ecosystem has relied primarily on using log files. Log files are unacceptable for most audit use cases in the enterprise as real-time monitoring is impossible, and log mechanics can be unreliable - a system crash before or during a write commit can compromise integrity and lead to data loss.

Cloudera Navigator is a fully integrated data management and security tool for the Hadoop platform. Data management and security capabilities are critical for enterprise customers that are in highly regulated industries and have stringent compliance requirements. This topic only provides an overview of some of the auditing and metadata management capabilities that Cloudera Navigator offers. For complete details, see Cloudera Data Management.

Cloudera Navigator

The following sections describe some of the categories of functionalities Cloudera Navigator provides for auditing, metadata management and lineage.

Auditing

While Hadoop has historically lacked centralized cross-component audit capabilities, products such as Cloudera Navigator add secured, real-time audit components to key data and access frameworks. Cloudera Navigator allows administrators to configure, collect, and view audit events, to understand who accessed what data and how. Cloudera Navigator also allows administrators to generate reports that list the HDFS access permissions granted to groups.Cloudera Navigator tracks access permissions and actual accesses to all entities in HDFS, Hive, HBase, Impala, Sentry, and Solr, and the Cloudera Navigator Metadata Server itself to help answer questions such as - who has access to which entities, which entities were accessed by a user, when was an entity accessed and by whom, what entities were accessed using a service, which device was used to access, and so on. Cloudera Navigator auditing supports tracking access to:
  • HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr services
  • HBase and Impala
  • Hive metadata
  • Sentry
  • Solr
  • Cloudera Navigator Metadata Server
Data collected from these services also provides visibility into usage patterns for users, ability to see point-in-time permissions and how they have changed (leveraging Sentry), and review and verify HDFS permissions. Cloudera Navigator also provides out-of-the-box integration with leading enterprise metadata, lineage, and SIEM applications. For details on how Cloudera Navigator handles auditing, see Cloudera Navigator Auditing Architecture.



Metadata Management

For metadata and data discovery, Cloudera Navigator features complete metadata storage. First, it consolidates the technical metadata for all data inside Hadoop into a single, searchable interface and allows for automatic tagging of data based on the external sources entering the cluster. For example, if there is an external ETL process, data can be automatically tagged as such when it enters Hadoop. Second, it supports user-based tagging to augment files, tables, and individual columns with custom business context, tags, and key/value pairs. Combined, this allows data to be easily discovered, classified, and located to not only support governance and compliance, but also user discovery within Hadoop.

Cloudera Navigator also includes metadata policy management that can trigger actions (such as the autoclassification of metadata) for specific datasets based on arrival or scheduled intervals. This allows users to easily set, monitor, and enforce data management policies, while also integrating with common third-party tools.

For details on how Cloudera Navigator handles metadata, see Cloudera Navigator Metadata Architecture.

Lineage

Cloudera Navigator provides an automatic collection and easy visualization of upstream and downstream data lineage to verify reliability. For each data source, it shows, down to the column-level within that data source, what the precise upstream data sources were, the transforms performed to produce it, and the impact that data has on downstream artifacts. Cloudera Navigator supports tracking the lineage of HDFS files, datasets, and directories, Hive tables and columns, MapReduce and YARN jobs, Hive queries, Impala queries, Pig scripts, Oozie workflows, Spark jobs, and Sqoop jobs. For details, see Cloudera Navigator Lineage Diagrams.



Integration within an EDH

The monitoring and reporting of Hadoop systems, while critical elements to its enterprise usage, are only a part of an enterprise’s complete audit infrastructure and data policy. Often these enterprise tools and policies require that all audit information route through a central interface to aid comprehensive reporting, and Hadoop-specific audit data can be integrated with these existing enterprise SIEM applications and other tools. For example, Cloudera Navigator exposes Hadoop audit data through several delivery methods:
  • Using syslog, thus acting as a mediator between the raw event streams in Hadoop and the SIEM tools.
  • Using a REST API for custom enterprise tools.
  • You can also simply export the data to a file, such as a comma-delimited text file.

Auditing in Hadoop Projects

This section describes the CDH and managed service versions supported by the Cloudera Navigator auditing and metadata features.

Cloudera Navigator Auditing

This section describes the audited operations and service versions supported by Cloudera Navigator auditing.
Component Operations (For details, see Cloudera Navigator Auditing). Minimum Supported Service Version
HDFS
  • Operations that access or modify a file's or directory's data or metadata
  • Operations denied due to lack of privileges
CDH 4.0.0
HBase
  • In CDH versions less than 4.2.0, for grant and revoke operations, the operation in log events is ADMIN
  • In simple authentication mode, if the HBase Secure RPC Engine property is false (the default), the username in log events is UNKNOWN. To see a meaningful user name:
    1. Click the HBase service.
    2. Click the Configuration tab.
    3. Select Service-wide > Security.
    4. Set the HBase Secure RPC Engine property to true.
    5. Save the change and restart the service.
CDH 4.0.0
Hive
  • Operations (except grant, revoke, and metadata access only) sent to HiveServer2
  • Operations denied due to lack of privileges
Limitations:
  • Actions taken against Hive using the Hive CLI are not audited. Therefore if you have enabled auditing you should disable the Hive CLI to prevent actions against Hive that are not audited.
  • In simple authentication mode, the username in log events is the username passed in the HiveServer2 connect command. If you do not pass a username in the connect command, the username is log events is anonymous.
CDH 4.2.0, CDH 4.4.0 for operations denied due to lack of privileges.
Hue
  • Operations (except grant, revoke, and metadata access only) sent through the Beeswax Server
CDH 4.4.0
  • User operations such as log in, log out, add and remove user, add and remove LDAP group, add and remove user from LDAP group
CDH 5.5.0
Impala
  • Queries denied due to lack of privileges
  • Queries that pass analysis
Impala 1.2.1 with CDH 4.4.0
Navigator Metadata Server
  • Viewing and changing audit reports
  • Viewing and changing authorization configurations
  • Viewing and changing metadata
  • Viewing and changing policies
  • Viewing and changing saved searches
Cloudera Navigator 2.3
Sentry
  • Operations sent to the HiveServer2 and Hive Metastore Server roles and Impala service
  • Adding and deleting roles, assigning roles to groups and removing roles from groups, creating and deleting privileges, granting and revoking privileges
  • Operations denied due to lack of privileges
You do not directly configure the Sentry service for auditing. Instead, when you configure the Hive and Impala services for auditing, grant, revoke, and metadata operations appear in the Hive or Impala service audit logs.
CDH 5.1.0
Solr
  • Index creation and deletion
  • Schema and configuration file modification
  • Index, service, document tag access
CDH 5.4.0

Cloudera Navigator Metadata

This section describes the CDH and managed service versions supported by the Cloudera Navigator metadata feature.
Component Minimum Supported Version
HDFS. However, federated HDFS is not supported. CDH 4.4.0
Hive CDH 4.4.0
Impala CDH 5.4.0
MapReduce CDH 4.4.0
Oozie. Supported actions:
  • 2.4 - map-reduce, pig, hive, hive2, sqoop
  • 2.3 and lower - map-reduce, pig, hive, sqoop
CDH 4.4.0
Pig CDH 4.6.0
Spark CDH 5.4.0
Sqoop 1. All Cloudera connectors are supported. CDH 4.4.0
YARN CDH 5.0.0