Introduction to Hadoop Security
In versions of Apache Hadoop prior to CDH3 Beta 3, user authentication and group resolution was performed on the client machines which were accessing a Hadoop cluster. There was no attempt to verify the identity and group membership of users who interact with Hadoop Distributed File System (HDFS) or MapReduce. Even though HDFS has had file and directory permissions since version 0.16, without strong authentication guarantees, these permissions were only useful to prevent accidental data loss. Malicious users could easily impersonate other users, rendering the enforcement of the permissions impossible. Furthermore, even if users could be authenticated to HDFS, all map tasks necessarily ran under a single, shared user account, which allowed users to access each other's resources.
The security features in CDH4 enable Hadoop to prevent malicious user impersonation. The Hadoop daemons leverage Kerberos to perform user authentication on all remote procedure calls (RPCs). Group resolution is performed on the Hadoop master nodes, NameNode, JobTracker and ResourceManager to guarantee that group membership cannot be manipulated by users. Map tasks are run under the user account of the user who submitted the job, ensuring isolation there. In addition to these features, new authorization mechanisms have been introduced to HDFS and MapReduce to enable more control over user access to data.
The security features in CDH4 meet the needs of most Hadoop customers because typically the cluster is accessible only to trusted personnel. In particular, Hadoop's current threat model assumes that users cannot:
- Have root access to cluster machines.
- Have root access to shared client machines.
- Read or modify packets on the network of the cluster.
CDH4.1.0 and later releases support encryption of all user data sent over the network. For configuration instructions, see Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport.
CDH4.0.x, however, does not support data encryption. RPC data may be encrypted on the wire, but actual user data is not encrypted. For most current Hadoop users, this lack of data encryption in CDH4.0.x is acceptable because of the assumptions stated above. However, if you need data encryption, you can upgrade to CDH4.1.x or later.
Note also that there is no built-in support for on-disk encryption in any of the CDH4.x releases.