Troubleshooting Installation and Upgrade Problems

The Cloudera Manager Server fails to start after upgrade.

The Cloudera Manager Server fails to start after upgrade.

Possible Reasons

There were active commands running before upgrade. This includes commands a user might have run and also for commands Cloudera Manager automatically triggers, either in response to a state change, or something that's on a schedule.

Possible Solutions

For information on known issues, see Known Issues and Workarounds in Cloudera Manager 5.

Navigator HSM KMS Backed by Thales HSM installation fails

The installation of the Navigator HSM KMS backed by Thales HSM fails with the following error message in the role log:
ERROR: Hadoop KMS could not be started

REASON: com.ncipher.provider.nCRuntimeException: com.ncipher.km.nfkm.nfkmCommunicationException The nfkm command program has terminated unexpectedly.

Possible Reasons

The KMS user is not part of the nfast group on the host(s) running the Navigator HSM KMS backed by Thales HSM role.

Possible Solutions

Add the KMS user to the nfast group on the host(s) running the Navigator HSM KMS backed by Thales HSM role:
$ sudo usermod -G nfast kms

Failed to start server reported by cloudera-manager-installer.bin

"Failed to start server" reported by cloudera-manager-installer.bin. /var/log/cloudera-scm-server/cloudera-scm-server.logcontains a message beginning Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver...

Possible Reasons

You might have SELinux enabled.

Possible Solutions

Disable SELinux by running sudo setenforce 0 on the Cloudera Manager Server host. To disable it permanently, edit /etc/selinux/config. For more information, see Setting SELinux mode.

Installation interrupted and installer does not restart

Installation interrupted and installer does not restart.

Possible Reasons

You need to do some manual cleanup.

Possible Solutions

Cloudera Manager Server fails to start with MySQL

Cloudera Manager Server fails to start and the Server is configured to use a MySQL database to store information about service configuration.

Possible Reasons

Tables might be configured with the ISAM engine. The Server does not start if its tables are configured with the MyISAM engine, and an error such as the following appears in the log file:
Tables ... have unsupported engine type ... .  InnoDB is required.

Possible Solutions

Make sure that the InnoDB engine is configured, not the MyISAM engine. To check what engine your tables are using, run the following command from the MySQL shell: mysql> show table status;

For more information, see Install and Configure MySQL for Cloudera Software.

Agents fail to connect to Server

Agents fail to connect to Server. You get an Error 113 ('No route to host') in /var/log/cloudera-scm-agent/cloudera-scm-agent.log.

Possible Reasons

You might have SELinux or iptables enabled.

Possible Solutions

Check /var/log/cloudera-scm-server/cloudera-scm-server.log on the Server host and /var/log/cloudera-scm-agent/cloudera-scm-agent.log on the Agent hosts. Disable SELinux and iptables. For more information, see Setting SELinux mode and Disabling the Firewall.

Cluster hosts do not appear

Some cluster hosts do not appear when you click Find Hosts in install or update wizard.

Possible Reasons

You may have network connectivity problems.

Possible Solutions

  • Make sure all cluster hosts have SSH port 22 open.
  • Check other common causes of loss of connectivity such as firewalls and interference from SELinux. For more information, see Setting SELinux mode and Disabling the Firewall.

"Access denied" in install or update wizard

"Access denied" in install or update wizard during database configuration for Activity Monitor or Reports Manager.

Possible Reasons

Hostname mapping or permissions are not set up correctly.

Possible Solutions

  • For hostname configuration, see Configuring Network Names.
  • For permissions, make sure the values you enter into the wizard match those you used when you configured the databases. The value you enter into the wizard as the database hostname must match the value you entered for the hostname (if any) when you configured the database.

    For example, if you had entered the following when you created the database

    grant all on activity_monitor.* TO 'amon_user'@'myhost1.myco.com' IDENTIFIED BY 'amon_password';

    the value you enter here for the database hostname must be myhost1.myco.com. If you did not specify a host, or used a wildcard to allow access from any host, you can enter either the fully qualified domain name (FQDN), or localhost. For example, if you entered

    grant all on activity_monitor.* TO 'amon_user'@'%' IDENTIFIED BY 'amon_password';

    the value you enter for the database hostname can be either the FQDN or localhost.

Databases fail to start.

Activity Monitor, Reports Manager, or Service Monitor databases fail to start.

Possible Reasons

MySQL binlog format problem.

Possible Solutions

Set binlog_format=mixed in /etc/my.cnf. For more information, see this MySQL bug report. See also Cloudera Manager and Managed Service Datastores.

Cannot start services after upgrade

You have upgraded the Cloudera Manager Server, but now cannot start services.

Possible Reasons

You may have mismatched versions of the Cloudera Manager Server and Agents.

Possible Solutions

Make sure you have upgraded the Cloudera Manager Agents on all hosts. (The previous version of the Agents will heartbeat with the new version of the Server, but you cannot start HDFS and MapReduce with this combination.)

Cloudera services fail to start

Cloudera services fail to start.

Possible Reasons

Java might not be installed or might be installed at a custom location.

Possible Solutions

See Configuring a Custom Java Home Location for more information on resolving this issue.

Activity Monitor displays a status of BAD

The Activity Monitor displays a status of BAD in the Cloudera Manager Admin Console. The log file contains the following message:

ERROR 1436 (HY000): Thread stack overrun: 7808 bytes used of a 131072 byte stack, and 128000 bytes needed. 
Use 'mysqld -O thread_stack=#' to specify a bigger stack. 

Possible Reasons

The MySQL thread stack is too small.

Possible Solutions

  1. Update the thread_stack value in my.cnf to 256KB. The my.cnf file is normally located in /etc or /etc/mysql.
  2. Restart the mysql service: $ sudo service mysql restart
  3. Restart Activity Monitor.

Activity Monitor fails to start

The Activity Monitor fails to start. Logs contain the error read-committed isolation not safe for the statement binlog format.

Possible Reasons

The binlog_format is not set to mixed.

Possible Solutions

Modify the mysql.cnf file to include the entry for binlog format as specified in Install and Configure MySQL for Cloudera Software.

Attempts to reinstall lower version of Cloudera Manager fail

Attempts to reinstall lower versions of CDH or Cloudera Manager using yum fails.

Possible Reasons

It is possible to install, uninstall, and reinstall CDH and Cloudera Manager. In certain cases, this does not complete as expected. If you install Cloudera Manager 5 and CDH 5, then uninstall Cloudera Manager and CDH, and then attempt to install CDH 4 and Cloudera Manager 4, incorrect cached information may result in the installation of an incompatible version of the Oracle JDK.

Possible Solutions

Clear information in the yum cache:

  1. Connect to the CDH host.
  2. Execute either of the following commands:
    $ yum --enablerepo='*'clean
              all

    or

    $ rm -rf
                /var/cache/yum/cloudera*
  3. After clearing the cache, proceed with installation.

Create Hive Metastore Database Tables command fails

The Create Hive Metastore Database Tables command fails due to a problem with an escape string.

Possible Reasons

PostgreSQL versions 9 and higher require special configuration for Hive because of a backward-incompatible change in the default value of the standard_conforming_strings property. Versions up to PostgreSQL 9.0 defaulted to off, but starting with version 9.0 the default is on.

Possible Solutions

As the administrator user, use the following command to turn standard_conforming_strings off:
ALTER DATABASE <hive_db_name> SET standard_conforming_strings = off; 

HDFS DataNodes fail to start

After upgrading to CDH 5, HDFS DataNodes fail to start with exception:

Exception in secureMainjava.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 4294967296 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
    

Possible Reasons

HDFS caching, which is enabled by default in CDH 5, requires new memlock functionality from Cloudera Manager Agents.

Possible Solutions

Do the following:

  1. Stop all CDH and managed services.
  2. On all hosts with Cloudera Manager Agents, hard-restart the Agents. Before performing this step, ensure you understand the semantics of the hard_restart command by reading Hard Stopping and Restarting Agents.
    • Packages
      RHEL 7, SLES 12, Debian 8, Ubuntu 16.04
      sudo /etc/init.d/cloudera-scm-agent next_stop_hard
      sudo systemctl restart cloudera-scm-agent
      RHEL 5 or 6, SLES 11, Debian 6 or 7, Ubuntu 12.04, 14.04
      sudo service cloudera-scm-agent hard_restart
    • Tarballs
      • To stop the Cloudera Manager Agent, run this command on each Agent host:
        • RHEL-compatible 7 and higher:
          $ sudo tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
          $ sudo tarball_root/etc/init.d/cloudera-scm-agent restart
        • All other Linux distributions:
          $ sudo tarball_root/etc/init.d/cloudera-scm-agent hard_restart
      • If you are running single user mode, start Cloudera Manager Agent using the user account you chose. For example, to run the Cloudera Manager Agent as cloudera-scm, you have the following options:
        • Run the following command:
          • RHEL-compatible 7 and higher:
            $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
            $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent restart
          • All other Linux distributions:
            $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent hard_restart 
        • Edit the configuration files so the script internally changes the user, and then run the script as root:
          1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
            export CMF_SUDO_CMD=" "
          2. Change the user and group in tarball_root/etc/init.d/cloudera-scm-agent to the user you want the Agent to run as. For example, to run as cloudera-scm, change the user and group as follows:
            USER=cloudera-scm
            GROUP=cloudera-scm
          3. Run the Agent script as root:
            • RHEL-compatible 7 and higher:
              $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent next_stop_hard
              $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent restart
            • All other Linux distributions:
              $ sudo -u cloudera-scm tarball_root/etc/init.d/cloudera-scm-agent hard_restart 
  3. Start all services.

Create Hive Metastore Database Tables command fails

You see the following error in NameNode log:
2014-10-16 18:36:29,112 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
        java.io.IOException:File system image contains an old layout version -55.An upgrade to version -59 is required.
        Please  restart NameNode with the "-rollingUpgrade started" option if a rolling  upgrade is already started; or restart NameNode with the "-upgrade"
        option to start a new upgrade.        
              at
        org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231) 
              at
        org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994) 
              at
        org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
        2014-10-16 18:36:29,126 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50070
        2014-10-16 18:36:29,127 WARN org.apache.hadoop.http.HttpServer2: HttpServer Acceptor: isRunning is false. Rechecking.
        2014-10-16 18:36:29,127 WARN org.apache.hadoop.http.HttpServer2: HttpServer Acceptor: isRunning is false
        2014-10-16 18:36:29,127 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
        2014-10-16 18:36:29,128 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
        2014-10-16 18:36:29,128 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
        2014-10-16 18:36:29,128 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
        java.io.IOException: File system image contains an old layout version -55.An upgrade to version -59 is required.
        Please  restart NameNode with the "-rollingUpgrade started" option if a rolling  upgrade is already
        started; or restart NameNode with the "-upgrade"  option to start a new upgrade.        
              at
        org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231) 
              at
        org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994) 
              at
        org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410) 
              at
        org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
        2014-10-16 18:36:29,130 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
        2014-10-16 18:36:29,132 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

Possible Reasons

You upgraded CDH to 5.2 using Cloudera Manager and did not run the HDFS Metadata Upgrade command.

Possible Solutions

Stop the HDFS service in Cloudera Manager and follow the steps for upgrade (depending on whether you are using packages or parcels) described in Upgrading CDH and Managed Services Using Cloudera Manager.

Oracle invalid identifier

If you are using an Oracle database and the Cloudera Navigator Analytics > Audit > Activity tab displays "No data available" and there is an Oracle error about "invalid identifier" with the query containing the reference to dbms_crypto in the log.

Possible Reasons

You have not granted execute permission to sys.dbms_crypto.

Possible Solutions

Run GRANT EXECUTE ON sys.dbms_crypto TO nav;, where nav is the user of the Navigator Audit Server database.