This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

Appendix A – Troubleshooting

Problem 1: Running any Hadoop command fails after enabling security.

Description:

A user must have a valid Kerberos ticket in order to interact with a secure Hadoop cluster. Running any Hadoop command (such as hadoop fs -ls) will fail if you do not have a valid Kerberos ticket in your credentials cache. If you do not have a valid ticket, you will receive an error such as:

11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Solution:

You can examine the Kerberos tickets currently in your credentials cache by running the klist command. You can obtain a ticket by running the kinit command and either specifying a keytab file containing credentials, or entering the password for your principal.

Problem 2: Java is unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher.

Description:

If you are running MIT Kerberos 1.8.1 or higher, the following error will occur when you attempt to interact with the Hadoop cluster, even after successfully obtaining a Kerberos ticket using kinit:

11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Because of a change 1 in the format in which MIT Kerberos writes its credentials cache, there is a bug 2 in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher. Kerberos 1.8.1 is the default in Ubuntu Lucid and later releases and Debian Squeeze and later releases. (On RHEL and CentOS, an older version of MIT Kerberos which does not have this issue, is the default.)

Footnotes: 1 MIT Kerberos change: http://krbdev.mit.edu/rt/Ticket/Display.html?id=6206 2 Report of bug in Oracle JDK 6 Update 26 and earlier: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6979329

Solution:

If you encounter this problem, you can work around it by running kinit -R after running kinit initially to obtain credentials. Doing so will cause the ticket to be renewed, and the credentials cache rewritten in a format which Java can read. To illustrate this:

$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1000)
$ hadoop fs -ls
11/01/04 13:15:51 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit
Password for atm@YOUR-REALM.COM: 
$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: atm@YOUR-REALM.COM

Valid starting     Expires            Service principal
01/04/11 13:19:31  01/04/11 23:19:31  krbtgt/YOUR-REALM.COM@YOUR-REALM.COM
	renew until 01/05/11 13:19:30
$ hadoop fs -ls
11/01/04 13:15:59 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit -R
$ hadoop fs -ls
Found 6 items
drwx------   - atm atm          0 2011-01-02 16:16 /user/atm/.staging
  Note:
This workaround for Problem 2 requires the initial ticket to be renewable. Note that whether or not you can obtain renewable tickets is dependent upon a KDC-wide setting, as well as a per-principal setting for both the principal in question and the Ticket Granting Ticket (TGT) service principal for the realm. A non-renewable ticket will have the same values for its "valid starting" and "renew until" times. If the initial ticket is not renewable, the following error message is displayed when attempting to renew the ticket:
kinit: Ticket expired while renewing credentials

Problem 3: java.io.IOException: Incorrect permission

Description:

An error such as the following example is displayed if the user running one of the Hadoop daemons has a umask of 0002, instead of 0022:

java.io.IOException: Incorrect permission for
/var/folders/B3/B3d2vCm4F+mmWzVPB89W6E+++TI/-Tmp-/tmpYTil84/dfs/data/data1,
expected: rwxr-xr-x, while actual: rwxrwxr-x
       at org.apache.hadoop.util.DiskChecker.checkPermission(DiskChecker.java:107)
       at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:144)
       at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:160)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1484)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1432)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1408)
       at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:418)
       at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:279)
       at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:203)
       at org.apache.hadoop.test.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:152)
       at org.apache.hadoop.test.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129)
       at org.apache.hadoop.test.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:308)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
       at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:83)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Solution:

Make sure that the umask for hdfs and mapred is 0022.

Problem 4: A cluster fails to run jobs after security is enabled.

Description:

A cluster that was previously configured to not use security may fail to run jobs for certain users on certain TaskTrackers (MRv1) or NodeManagers (YARN) after security is enabled:

1. A cluster is at some point in time configured without security enabled. 2. A user X runs some jobs on the cluster, which creates a local user directory on each TaskTracker or NodeManager. 3. Security is enabled on the cluster. 4. User X tries to run jobs on the cluster, and the local user directory on (potentially a subset of) the TaskTrackers or NodeManagers is owned by the wrong user or has overly-permissive permissions.

The bug is that after step 2, the local user directory on the TaskTracker or NodeManager should be cleaned up, but isn't.

If you're encountering this problem, you may see errors in the TaskTracker or NodeManager logs. The following example is for a TaskTracker on MRv1:

10/11/03 01:29:55 INFO mapred.JobClient: Task Id : attempt_201011021321_0004_m_000011_0, Status : FAILED 
Error initializing attempt_201011021321_0004_m_000011_0: 
java.io.IOException: org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:212) 
at org.apache.hadoop.mapred.LinuxTaskController.initializeUser(LinuxTaskController.java:442) 
at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeUserDirs(Localizer.java:272) 
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:963) 
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209) 
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174) 
Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:250) 
at org.apache.hadoop.util.Shell.run(Shell.java:177) 
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370) 
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:203) 
... 5 more

Solution:

Delete the mapred.local.dir or yarn.nodemanager.local-dirs directories for that user across the cluster.

Problem 5: The NameNode does not start and KrbException Messages (906) and (31) are displayed.

Description:

When you attempt to start the NameNode, a login failure occurs. This failure prevents the NameNode from starting and the following KrbException messages are displayed:

Caused by: KrbException: Integrity check on decrypted field failed (31) - PREAUTH_FAILED}}

and

Caused by: KrbException: Identifier doesn't match expected value (906)
  Note:

These KrbException error messages are displayed only if you enable debugging output. See Appendix D - Enabling Debugging Output for the Sun Kerberos Classes.

Solution:

Although there are several possible problems that can cause these two KrbException error messages to display, here are some actions you can take to solve the most likely problems:

  • If you are using CentOS/Red Hat Enterprise Linux 5.6 or later, or Ubuntu, which use AES-256 encryption by default for tickets, you must install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File on all cluster and Hadoop user machines. For information about how to verify the type of encryption used in your cluster, see Step 3: If you are Using AES-256 Encryption, install the JCE Policy File. Alternatively, you can change your kdc.conf or krb5.conf to not use AES-256 by removing aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. Note that after changing the kdc.conf file, you'll need to restart both the KDC and the kadmin server for those changes to take affect. You may also need to recreate or change the password of the relevant principals, including potentially the TGT principal (krbtgt/REALM@REALM).
kadmin.local: xst -norandkey -k hdfs.keytab hdfs/fully.qualified.domain.name HTTP/fully.qualified.domain.name
kadmin.local: xst -norandkey -k mapred.keytab mapred/fully.qualified.domain.name HTTP/fully.qualified.domain.name

Problem 6: The NameNode starts but clients cannot connect to it and error message contains enctype code 18.

Description:

The NameNode keytab file does not have an AES256 entry, but client tickets do contain an AES256 entry. The NameNode starts but clients cannot connect to it. The error message doesn't refer to "AES256", but does contain an enctype code "18".

Solution:

Make sure the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File" is installed or remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. For more information, see the first suggested solution above for Problem 5.

For more information about the Kerberos encryption types, see http://www.iana.org/assignments/kerberos-parameters/kerberos-parameters.xml.

(MRv1 Only) Problem 7: Jobs won't run and TaskTracker is unable to create a local mapred directory.

Description:

The TaskTracker log contains the following error message:

11/08/17 14:44:06 INFO mapred.TaskController: main : user is atm
11/08/17 14:44:06 INFO mapred.TaskController: Failed to create directory /var/log/hadoop/cache/mapred/mapred/local1/taskTracker/atm - No such file or directory
11/08/17 14:44:06 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (20)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
        ... 8 more

Solution:

Make sure the value specified for mapred.local.dir is identical in mapred-site.xml and taskcontroller.cfg. If the values are different, the error message above is returned.

(MRv1 Only) Problem 8: Jobs won't run and TaskTracker is unable to create a Hadoop logs directory.

Description:

The TaskTracker log contains the following error message:

11/08/17 14:48:23 INFO mapred.TaskController: Failed to create directory /home/atm/src/cloudera/hadoop/build/hadoop-0.23.2-cdh3u1-SNAPSHOT/logs1/userlogs/job_201108171441_0004 - No such file or directory
11/08/17 14:48:23 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (255)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
        ... 8 more

Solution:

In MRv1, the default value specified for hadoop.log.dir in mapred-site.xml is /var/log/hadoop-0.20-mapreduce. The path must be owned and be writable by the mapred user. If you change the default value specified for hadoop.log.dir, make sure the value is identical in mapred-site.xml and taskcontroller.cfg. If the values are different, the error message above is returned.

Problem 9: After you enable cross-realm trust, you can run Hadoop commands in the local realm but not in the remote realm.

Description:

After you enable cross-realm trust, authenticating as a principal in the local realm will allow you to successfully run Hadoop commands, but authenticating as a principal in the remote realm will not allow you to run Hadoop commands. The most common cause of this problem is that the principals in the two realms either don't have the same encryption type, or the cross-realm principals in the two realms don't have the same password. This issue manifests itself because you are able to get Ticket Granting Tickets (TGTs) from both the local and remote realms, but you are unable to get a service ticket to allow the principals in the local and remote realms to communicate with each other.

Solution:

On the local MIT KDC server host, type the following command in the kadmin.local or kadmin shell to add the cross-realm krbtgt principal:

kadmin:  addprinc -e "<enc_type_list>" krbtgt/YOUR-LOCAL-REALM.COMPANY.COM@AD-REALM.COMPANY.COM

where the <enc_type_list> parameter specifies the types of encryption this cross-realm krbtgt principal will support: AES, DES, or RC4 encryption. You can specify multiple encryption types using the parameter in the command above, what's important is that at least one of the encryption types parameters corresponds to the encryption type found in the tickets granted by the KDC in the remote realm. For example:

kadmin:  addprinc -e "aes256-cts:normal rc4-hmac:normal des3-hmac-sha1:normal" krbtgt/YOUR-LOCAL-REALM.COMPANY.COM@AD-REALM.COMPANY.COM

(MRv1 Only) Problem 10: Jobs won't run and can't access files in mapred.local.dir .

Description:

The TaskTracker log contains the following error message:

WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (1) 

Solution:

  1. Add the mapred user to the mapred and hadoop groups on all hosts.
  2. Restart all TaskTrackers.

Problem 11: Users are unable to obtain credentials when running Hadoop jobs or commands.

Description:

This error occurs because the ticket message is too large for the default UDP protocol. An error message similar to the following may be displayed:

13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. 
(63) - No service creds)]

Solution:

Force Kerberos to use TCP instead of UDP by adding the following parameter to libdefaults in the krb5.conf file on the client(s) where the problem is occurring.

[libdefaults]
udp_preference_limit = 1
 

More Info About the udp_preference_limit Property

When sending a message to the KDC, the library will try using TCP before UDP if the size of the ticket message is larger than the setting specified for the udp_preference_limit property. If the ticket message is smaller than udp_preference_limit setting, then UDP will be tried before TCP. Regardless of the size, both protocols will be tried if the first attempt fails.

Problem 12: Request is a replay exceptions in the logs.

Description:

Symptom: The following exception shows up in the logs for one or more of the Hadoop daemons:

2013-02-28 22:49:03,152 INFO  ipc.Server (Server.java:doRead(571)) - IPC Server listener on 8020: readAndProcess threw exception javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism l
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))]
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:159)
        at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1040)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1213)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:566)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:363)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))
        at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:741)
        at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:323)
        at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:267)
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:137)
        ... 4 more
Caused by: KrbException: Request is a replay (34)
        at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:300)
        at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:134)
        at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:79)
        at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:724)
        ... 7 more

In addition, this problem can manifest itself as performance issues for all clients in the cluster, including dropped connections, timeouts attempting to make RPC calls, and so on.

Likely causes:
  • Multiple services in the cluster are using the same kerberos principal. All secure clients that run on multiple machines should use unique kerberos principals for each machine. For example, rather than connecting as a service principal myservice@EXAMPLE.COM, services should have per-host principals such as myservice/host123.example.com@EXAMPLE.COM.
  • Clocks not in synch: All hosts should run NTP so that clocks are kept in synch between clients and servers.