Troubleshooting Cloudera Data Science Workbench
Check the status of the application.
Make sure the contents of the configuration file are correct.
SSH to your master host and run the following node validation command to check that the key services are running:
The following sections describe solutions to potential problems and error messages you may encounter while installing, configuring or using Cloudera Data Science Workbench. There is also an example of the Cloudera Data Science Workbench configuration file for your reference.
Understanding Installation Warnings
This section describes solutions to some warnings you might encounter during the installation process.
Preexisting iptables rules not supported
WARNING: Cloudera Data Science Workbench requires iptables, but does not support preexisting iptables rules.Kubernetes makes extensive use of iptables. However, it’s hard to know how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you run the following command to disable all pre-existing rules before you proceed with the installation.
service iptables stop
Please remove the entry corresponding to /dev/xvdc from /etc/fstab
Cloudera Data Science Workbench installs a custom filesystem on its Application and Docker block devices. These filesystems will be used to store user project files and Docker engine images respectively. Therefore, Cloudera Data Science Workbench requires complete access to the block devices. To avoid losing any existing data, make sure the block devices allocated to Cloudera Data Science Workbench are reserved only for the workbench.
Linux sysctl kernel configuration errors
Kubernetes and Docker require non-standard kernel configuration. Depending on the existing state of your kernel, this might result in sysctl errors such as:
sysctl net.bridge.bridge-nf-call-iptables must be set to 1
This is because the settings in /etc/sysctl.conf conflict with the settings required by Cloudera Data Science Workbench. Cloudera cannot make a blanket recommendation on how to resolve such errors because they are specific to your deployment. Cluster administrators may choose to either remove or modify the conflicting value directly in /etc/sysctl.conf, remove the value from the conflicting configuration file, or even delete the module that is causing the conflict.
To start diagnosing the issue, run the following command to see the list of configuration files that are overwriting values in /etc/sysctl.conf.
You will see output similar to:
Parsing /usr/lib/sysctl.d/00-system.conf Parsing /usr/lib/sysctl.d/50-default.conf Parsing /etc/sysctl.d/99-sysctl.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Parsing /etc/sysctl.d/k8s.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-iptables in file '/etc/sysctl.d/k8s.conf'. Parsing /etc/sysctl.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'. Setting 'net/ipv4/conf/all/promote_secondaries' to '1' Setting 'net/ipv4/conf/default/promote_secondaries' to '1' Setting 'net/ipv6/conf/default/disable_ipv6' to '0' Setting 'kernel/sysrq' to '16' ...
/etc/sysctl.d/k8s.conf is the configuration added by Cloudera Data Science Workbench. Administrators must make sure that no other file is overwriting values set by /etc/sysctl.d/k8s.conf.
CDH parcels not found at /opt/cloudera/parcels
- If you are using a custom parcel directory, you can ignore the warning and proceed with the installation. Once the Cloudera Data Science Workbench is running, set the path to the CDH parcel in the admin dashboard. See Non-standard CDH Parcel Location.
- This warning can be an indication that you have not added gateway roles to the Cloudera Data Science Workbench nodes. In this case, do not ignore the warning. Exit the installer and go to Cloudera Manager to add gateway roles to the cluster. See Configure Gateway Hosts Using Cloudera Manager.
404 Not Found Error
The 404 Not Found error might appear in the browser when you try to reach the Cloudera Data Science Workbench web application.
This error is an indication that your installation of Cloudera Data Science Workbench was successful, but there was a mismatch in the domain configured in cdsw.conf and the domain referenced in the browser. To fix the error, go to /etc/cdsw/config/cdsw.conf and check that the URL you supplied for the DOMAIN property matches the one you are trying to use to reach the web application. This is the wildcard domain dedicated to Cloudera Data Science Workbench, not the hostname of the master node.
If this requires a change to cdsw.conf, after saving the changes run cdsw reset followed by cdsw init.
Troubleshooting Kerberos Errors
HDFS commands fail with Kerberos errors even though Kerberos authentication is successful in the web application
If Kerberos authentication is successful in the web application, and the output of klist in the engine reveals a valid-looking TGT, but commands such as hdfs dfs -ls / still fail with a Kerberos error, it is possible that your cluster is missing the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File. The JCE policy file is required when Red Hat uses AES-256 encryption. This library should be installed on each cluster host and will live under $JAVA_HOME. For more information, see Using AES-256 Encryption.
Cannot find renewable Kerberos TGT
16/12/24 16:38:40 WARN security.UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Resource temporarily unavailable while renewing credentials 16/12/24 16:41:23 WARN security.UserGroupInformation: PriviledgedActionException as:user@CLOUDERA.LOCAL (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
This is not a bug. Spark 2 workloads will not be affected by this. Access to Kerberized resources should also work as expected.
Troubleshooting TLS/SSL Errors
- Cloudera Data Science Workbench initialisation fails with an error such as:
Error preparing server: tls: failed to parse private key
- Your browser reports that the Cloudera Data Science Workbench web application is not secure even though you have enabled TLS settings as per Enabling TLS/SSL for Cloudera Data Science Workbench.
Possible Causes and Solutions
- Path to the private key and/or certificate is incorrect - Confirm that the path to the private key file is correct by comparing the path and file name to
the values for TLS_KEY and/or TLS_CERT in cdsw.conf or Cloudera Manager. For example:
- Private key file does not have the right permissions - The private key file must have read-only permissions. Set it as follows:
chmod 444 private.key
- Private key is encrypted - Cloudera Data Science Workbench does not support encrypted private keys. Check to see if your private key is encrypted:
$ cat private.key -----BEGIN RSA PRIVATE KEY----- Proc-Type: 4,ENCRYPTED DEK-Info: DES-EDE3-CBC,11556F53E4A2824AIf the private key is encrypted as shown above, use the following steps to decrypt it:
- Make a backup of the private key file.
mv private.key private.key.encrypted
- Decrypt the backup private key and save the file to private.key. You will be asked to enter the private key password.
openssl rsa -in private.key.encrypted -out private.key
- Make a backup of the private key file.
- Private key and certificate are not related - Check to see if the private key matches the public key in the certificate.
- Print a hash of the private key modulus.
openssl rsa -in private.key -noout -modulus | openssl md5 (stdin)= 7a8d72ed61bb4be3c1f59e4f0161c023
- Print a hash of the public key modulus.
openssl x509 -in cert.pem -noout -modulus | openssl md5 (stdin)= 7a8d72ed61bb4be3c1f59e4f0161c023If the md5 hash output of both keys is different, they are not related to each other, and will not work. You must revoke the old certificate, regenerate a new private key and Certificate Signing Request (CSR), and then apply for a new certificate.
- Print a hash of the private key modulus.
Troubleshooting Issues with Running Workloads
This section describes some potential issues data scientists might encounter once the application is running workloads.
404 error in Workbench after starting an engine
This is typically caused because a wildcard DNS subdomain was not set up before installation. While the application will largely work, the engine consoles are served on subdomains and will not be routed correctly unless a wildcard DNS entry pointing to the master node is properly configured. You might need to wait 30-60 minutes until the DNS entries propagate. For instructions, see Set Up a Wildcard DNS Subdomain.
Engines cannot be scheduled due to lack of CPU or memory
A symptom of this is the following error message in the Workbench: "Unschedulable: No node in the cluster currently has enough CPU or memory to run the engine."
Either shut down some running sessions or jobs or provision more nodes for Cloudera Data Science Workbench.
Workbench prompt flashes red and does not take input
The Workbench prompt flashing red indicates that the session is not currently ready to take input.
Cloudera Data Science Workbench does not currently support non-REPL interaction. One workaround is to skip the prompt using appropriate command-line arguments. Otherwise, consider using the terminal to answer interactive prompts.
PySpark jobs fail due to HDFS permission errors
: org.apache.hadoop.security.AccessControlException: Permission denied: user=alice, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
hdfs dfs -mkdir /user/<username> hdfs dfs -chown <username>:<username> /user/<username>
PySpark jobs fail due to Python version mismatch
Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions
One solution is to install the matching Python 2.7 version on all the cluster hosts. Another, more recommended solution is to install the Anaconda parcel on all CDH cluster hosts. Cloudera Data Science Workbench Python engines will use the version of Python included in the Anaconda parcel which ensures Python versions between driver and workers will always match. Any library paths in workloads sent from drivers to workers will also match because Anaconda is present in the same location across all hosts. Once the parcel has been installed, set the PYSPARK_PYTHON environment variable in the Cloudera Data Science Workbench Admin dashboard. Alternatively, you can use Cloudera Manager to set the path.