Apache Hadoop Incompatible Changes and Limitations
The following incompatible changes have been introduced in CDH 5:
In CDH5.10.0 and above, the output of the fsck command changed to also contain information about decommissioned replicas, and therefore if you try to parse the output of fsck, the parser could fail.
In CDH 5.12.0, JMXJSONServlet replaced JSONP output with Cross-origin resource sharing (CORS) to prevent potential cross-site scripting attacks. External applications relying on the existing output may need to be updated because the new output is not compatible.
- HDFS-6962: ACL inheritance conflicts with umaskmode
- HDFS-6434 Default permission for creating file should be 644 for WebHdfs/HttpFS.
- HDFS-9085 Show renewer information in DelegationTokenIdentifier#toString.
- The getSnapshottableDirListing() method returns null when there are no snapshottable directories. This is a change from CDH 5 Beta 2 where the method returns an empty array instead.
- Files named .snapshot or .reserved must not exist within HDFS.
- HADOOP-10020: Disable symlinks temporarily.
- HDFS-2832 - The HDFS internal layout version has changed between CDH 5 Beta 1 and CDH 5 Beta 2, so a file system upgrade is required to move an existing Beta 1 cluster to Beta 2.
- HDFS-4451: HDFS balancer command returns exit code 0 on success instead of 1.
- HDFS-4594: WebHDFS open sets Content-Length header to what is specified by length parameter
rather than how much data is actually returned.
- Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
- HDFS-4659: Support setting execution bit for regular files.
- Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
- HDFS-4997 - libhdfs functions now return correct error codes in errno in case of an error, instead of always returning 255.
- HDFS-5138 - The -finalize NameNode startup option has been removed. To finalize an in-progress upgrade, you should instead use the hdfs dfsadmin -finalizeUpgrade command while your NameNode is running, or while both NameNodes are running in a High Availability setup.
- HDFS-7279 - In CDH 5.5.0 and higher, DataNode WebHDFS implementation uses Netty as an HTTP server instead of Jetty.
- HADOOP-13508 - In CDH 5.11, the behavior of org.apache.hadoop.fs.permission.FsPermission#FsPermission(String mode) changed to fix a bug in parsing sticky bits. The new behavior may cause incompatible changes if an application depends on the original behavior.
- HDFS-11056 - CDH 5.5.6 and CDH 5.9.1 fixed a critical block corruption bug, but the fix
introduced a new bug. Specifically, DataNodes may accumulate too many open file descriptors for deleted meta files over time. The bug is fixed in newer versions, including CDH5.9.2 and CDH5.10.0
Workaround: Restart DataNodes.
Change in High-Availability Support
In CDH 5, the only high-availability (HA) implementation is Quorum-based storage; shared storage using NFS is no longer supported.
- Extract the files from the tarball.
- Create a symbolic link as follows:
ln -s install_dir/bin-mapreduce1 install_dir/share/hadoop/mapreduce1/bin
- Create a second symbolic link as follows:
ln -s install_dir/etc/hadoop-mapreduce1 install_dir/share/hadoop/mapreduce1/conf
- Set the HADOOP_HOME and HADOOP_CONF_DIR environment variables in your execution environment as follows:
$ export HADOOP_HOME=install_dir/share/hadoop/mapreduce1 $ export HADOOP_CONF_DIR=$HADOOP_HOME/conf
- Copy your existing start-dfs.sh and stop-dfs.sh scripts to install_dir/bin-mapreduce1
- For convenience, add install_dir/bin to the PATH variable in your execution environment.
Apache MapReduce 2.0 (YARN) Incompatible Changes
- The CATALINA_BASE variable no longer determines whether a component is configured for YARN or MRv1. Use the alternatives command instead, and make sure CATALINA_BASE is not set. see the Oozie and Sqoop2 configuration sections for instructions.
- YARN-1288 - YARN Fair Scheduler ACL change. Root queue defaults to everybody, and other queues default to nobody.
- YARN High Availability configurations have changed. Configuration keys have been renamed among other changes.
- The YARN_HOME property has been changed to HADOOP_YARN_HOME.
- Note the following changes to configuration properties in yarn-site.xml:
- The value of yarn.nodemanager.aux-services should be changed from mapreduce.shuffle to mapreduce_shuffle.
- yarn.nodemanager.aux-services.mapreduce.shuffle.class has been renamed to yarn.nodemanager.aux-services.mapreduce_shuffle.class
- yarn.resourcemanager.resourcemanager.connect.max.wait.secs has been renamed to yarn.resourcemanager.connect.max-wait.secs
- yarn.resourcemanager.resourcemanager.connect.retry_interval.secs has been renamed to yarn.resourcemanager.connect.retry-interval.secs
- yarn.resourcemanager.am. max-retries is renamed to yarn.resourcemanager.am.max-attempts
- The YARN_HOME environment variable used in the yarn.application.classpathhas been renamed to HADOOP_YARN_HOME. Make sure you include $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/* in the classpath. For more information, see Step 2: Configure YARN daemons in the instructions for deploying CDH with YARN in the Cloudera Installation and Upgrade guide.
- A CDH 4 client cannot be used against a CDH 5 cluster and vice-versa. Note that YARN in CDH 4 is experimental, and suffers from the following major incompatibilities.
- Almost all of the proto files have been renamed.
- Several user-facing APIs have been modified as part of an API stabilization effort.
Apache MapReduce 2.0 (YARN) Limitations
DockerContainerExecutor not supported in YARN
Cloudera does not support DockerContainerExecutor in YARN.
Node Manager configuration for YARN
A Spark on YARN job launches (with a ProcessBuilder shell executable) a Sqoop job with:
yarn jar /data/home/.../infaLib/sqoop-1.4.6-client.jar import -libjars \ file:///data/home/.../lib/avro-mapred-1.7.5-hadoop2.jar --connect jdbc:oracle:thin:@**********:1521 \ --username ******* -m 1 --as-avrodatafile --columns CUSTOMER_ID,ORDER_ID,.... --table ORDERS \ --target-dir hdfs://0.0.0.0:8020/user/..../d9b49b82_0f2e_41d3_bb31_ad6ff28aa966 \ --password-file ******
This command appears to work fine from the command line, using the default configuration files from /etc/hadoop/conf. However, when you launch it from the Spark application (with a ProcessBuilder shell exec), configuration from the Cloudera parcel directory is picked up instead, and the job fails with the following error:
Log Length: 88 Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
- Open the default etc/hadoop/conf/mapred-site.xml.
- Copy the excerpt with the mapreduce.application.classpath property.
- Go to the YARN NodeManager configuration page in Cloudera Manager.
- Paste the excerpt in the NodeManager Advanced Configuration Snippet field. For example:
<property> <name>mapreduce.application.classpath</name> <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/, $HADOOP_COMMON_HOME/lib/,$HADOOP_HDFS_HOME/,$HADOOP_HDFS_HOME/lib/, $HADOOP_YARN_HOME/,$HADOOP_YARN_HOME/lib/ </value> </property>