Security Overview for an Enterprise Data Hub

Any system managing data in production today must meet security requirements imposed by government and industry regulations, and by the general public, whose information may be housed in such systems. As a system designed to support ever-increasing amounts and types of data, Hadoop core and ecosystem components are constantly being re-evaluated and enhanced to meet ever-evolving security requirements, with the goal of thwarting any attack against that data.

This overview provides some of the details from a high level, starting with broad security requirements for information and the technology processes and capabilities aimed at meeting them. It includes a closer look at the four broad information security goals with overviews of how specific Hadoop components can meet those goals. Some overviews may include architectural considerations for set up and integration.

Security Requirements

Security encompasses broad business and operational goals that can be met by various technologies and processes:

  • Perimeter Security focuses on guarding access to the cluster, its data, and its various services. In information security, Authentication can help ensure that only validated users and processes are allowed entry to the cluster, node, or other protected target.
  • Data Protection means preventing any unauthorized access to data, at rest and in transit. In information security, this translates to Encryption.
  • Entitlement includes defining privileges for users, applications, and processes, and enforcing those what users and applications can do with data. In information security, this translates to Authorization.
  • Transparency refers to monitoring and reporting on data usage at the where, when, and how of data usage. Notions of transparency or visibility may be subsumed by the broader concept of data governance. In information security, this translates to Auditing.

The Hadoop ecosystem covers a wide range of applications, datastores, and computing frameworks, and each of these security components manifest these operational capabilities differently.

Securing a Hadoop Cluster in Stages

Given the complexity of security and the wide-range of possible cluster configurations, Cloudera recommends configuring the security capabilities only after successfully setting up a cluster without any security. Taking a phased approach to securing the cluster helps ensure that the cluster can meet the given target level.
  • Level 0: A non-secure cluster, that is, a fully functioning cluster without any possible security configurations. Never use such a system in a production environment: it is vulnerable to any and all attacks and exploits.
  • Level 1: A minimally secure cluster. First, set up authentication so that users and services cannot access the cluster until they prove their identities. Next, configure simple authorization mechanisms that let you assign privileges to users and user groups. Set up auditing procedures to keep track of who accesses the cluster (and how). Authentication, authorization, and auditing are minimal security measures only. Production systems would still require well-trained cluster administrators and effective security procedures, certified by an expert.
  • Level 2: For more robust security, encrypt all sensitive data (minimally) or encrypt all cluster data. Use key-management systems to handle encryption keys. In addition to encryption, set up auditing on data in metastores. Regularly review and update the system metadata. Ideally, set up your cluster so that you can trace the lineage of any data object and meet any goals that may fall under the rubric of data governance.
  • Level 3: Most secure, the secure enterprise data hub (EDH). Encrypt all data on the cluster, both at-rest and in-transit. Use a fault-tolerant key management system. Auditing mechanisms put into place must comply with industry, government, and regulatory standards (PCI, HIPAA, NIST, for example). The compliance requirement extends beyond the EDH that stores the data, to any system that integrates with it.

    Leveraging all four levels of security, Cloudera’s EDH platform can pass technology reviews for most common compliance regulations.

Hadoop Security Architecture

What follows is a detailed depiction of the Hadoop ecosystem in particular as it shows the interactions between different Cloudera Enterprise, security, and user management components. It also shows how a production environment with a couple of datacenters and assorted users and data feeds, both internal and external, will need to deal with receiving and authenticating so many insecure connections.
  • As illustrated, external data streams can be authenticated by mechanisms in place for Flume and Kafka. Any data from legacy databases is ingested using Sqoop. Users such as data scientists and analysts can interact directly with the cluster using interfaces such as Hue or Cloudera Manager. Alternatively, they could be using a service like Impala for creating and submitting jobs for data analysis. All of these interactions can be protected by an Active Directory Kerberos deployment.
  • Encryption can be applied to data at-rest using transparent HDFS encryption with an enterprise-grade Key Trustee Server. Cloudera also recommends using Navigator Encrypt to protect data on a cluster associated with the Cloudera Manager, Cloudera Navigator, Hive and HBase metastores, and any log files or spills.
  • Authorization policies can be enforced using Sentry (for services such as Hive, Impala and Search) as well as HDFS Access Control Lists.
  • Auditing capabilities can be provided by using Cloudera Navigator.