Apache Hadoop Known Issues
In Hadoop 2.0.0 and later, a number of Hadoop and HDFS properties have been deprecated. (The change dates from Hadoop 0.23.1, on which the Beta releases of CDH 4 were based). A list of deprecated properties and their replacements can be found at https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/DeprecatedProperties.html.
Upgrade Requires an HDFS Upgrade
Upgrading from any release earlier than CDH 5.2.0 to CDH 5.2.0 or later requires an HDFS Upgrade.
See Upgrading Unmanaged CDH Using the Command Line for further information.
Optimizing HDFS Encryption at Rest Requires Newer openssl Library on Some Systems
CDH 5.3 implements the Advanced Encryption Standard New Instructions (AES-NI), which provide substantial performance improvements. To get these improvements, you need a recent version of libcrypto.so on HDFS and MapReduce client hosts that is, any host from which you originate HDFS or MapReduce requests. Many OS versions have an older version of the library that does not support AES-NI.
See HDFS Data At Rest Encryption in the Encryption section of the Cloudera Security guide for instructions for obtaining the right version.
Other HDFS Encryption Known Issues
Potentially Incorrect Initialization Vector Calculation in HDFS Encryption
A mathematical error in the calculation of the Initialization Vector (IV) for encryption and decryption in HDFS could cause data to appear corrupted when read. The IV is a 16-byte value input to encryption and decryption ciphers. The calculation of the IV implemented in HDFS was found to be subtly different from that used by Java and OpenSSL cryptographic routines. The result is that data could possibly appear to be corrupted when it is read from a file inside an Encryption Zone.
Fortunately, the probability of this occurring is extremely small. For example, the maximum size of a file in HDFS is 64 TB. This enormous file would have a 1-in-4- million chance of hitting this condition. A more typically sized file of 1 GB would have a roughly 1-in-274-billion chance of hitting the condition.
Workaround: If you are using the experimental HDFS encryption feature in CDH 5.2, upgrade to CDH 5.3 and verify the integrity of all files inside an Encryption Zone.
DistCp between unencrypted and encrypted locations fails
By default, DistCp compares checksums provided by the filesystem to verify that data was successfully copied to the destination. However, when copying between unencrypted and encrypted locations, the filesystem checksums will not match since the underlying block data is different.
Workaround: Specify the -skipcrccheck and -update distcp flags to avoid verifying checksums.
Cannot move encrypted files to trash
With HDFS encryption enabled, you cannot move encrypted files or directories to the trash directory.
rm -r -skipTrash /testdir
If you install CDH using packages, HDFS NFS gateway works out of the box only on RHEL-compatible systems
Because of a bug in native versions of portmap/rpcbind, the HDFS NFS gateway does not work out of the box on SLES, Ubuntu, or Debian systems if you install CDH from the command-line, using packages. It does work on supported versions of RHEL-compatible systems on which rpcbind-0.2.0-10.el6 or later is installed, and it does work if you use Cloudera Manager to install CDH, or if you start the gateway as root.
- On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
- On SLES, Debian, and Ubuntu systems, do one of the following:
- Install CDH using Cloudera Manager; or
- As of CDH 5.1, start the NFS gateway as root; or
- Start the NFS gateway without using packages; or
- You can use the gateway by running rpcbind in insecure mode, using the -i option, but keep in mind that this allows anyone from a remote host to bind to the portmap.
HDFS does not currently provide ACL support for the HDFS gateway
No error when changing permission to 777 on .snapshot directory
Snapshots are read-only; running chmod 777 on the .snapshots directory does not change this, but does not produce an error (though other illegal operations do).
Snapshot operations are not supported by ViewFileSystem
Snapshots do not retain directories' quotas settings
Permissions for dfs.namenode.name.dir incorrectly set.
Hadoop daemons should set permissions for the dfs.namenode.name.dir (or dfs.name.dir) directories to drwx------ (700), but in fact these permissions are set to the file-system default, usually drwxr-xr-x (755).
Workaround: Use chmod to set permissions to 700. See Configuring Local Storage Directories for Use by HDFS for more information and instructions.
hadoop fsck -move does not work in a cluster with host-based Kerberos
Workaround: Use hadoop fsck -delete
HttpFS cannot get delegation token without prior authenticated request.
A request to obtain a delegation token cannot initiate an SPNEGO authentication sequence; it must be accompanied by an authentication cookie from a prior SPNEGO authentication sequence.
Workaround: Make another WebHDFS request (such as GETHOMEDIR) to initiate an SPNEGO authentication sequence and then make the delegation token request.
DistCp does not work between a secure cluster and an insecure cluster in some cases
Using DistCp with Hftp on a secure cluster using SPNEGO requires that the dfs.https.port property be configured
In order to DistCp using Hftp from a secure cluster using SPNEGO, you must configure the dfs.https.port property on the client to use the HTTP port (50070 by default).
Workaround: Configure dfs.https.port to use the HTTP port on the client
Non-HA DFS Clients do not attempt reconnects
This problem means that streams cannot survive a NameNode restart or network interruption that lasts longer than the time it takes to write a block.
DataNodes may become unresponsive to block creation requests
DataNodes may become unresponsive to block creation requests from clients when the directory scanner is running.
Workaround: Disable the directory scanner by setting dfs.datanode.directoryscan.interval to -1.
The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
Rolling upgrade is supported only for clusters managed by Cloudera Manager; you cannot do a rolling upgrade in a command-line-only deployment.
Checkpointing can fail due to an InvalidSignatureException in a secure cluster
Workaround:This problem occurs occasionally due to race condition. But the error is transient and a subsequent checkpoint may still succeed.
On a DataNode with a large number of blocks, the block report may exceed the maximum RPC buffer size
<property> <name>ipc.maximum.data.length</name> <value>268435456</value> </property>
DistCp to S3a fails due to integer overflow in retry timer
Writing to S3 under high load can cause com.amazonaws.AmazonClientException: Unable to complete transfer: timeout value is negative.
Workaround: Reduce the load to S3 by reducing the number of reducers or mappers.
- FileSystemRMStateStore: Cloudera recommends you use ZKRMStateStore (ZooKeeper-based implementation) to store the ResourceManager's internal state for recovery on restart or failover. Cloudera does not support the use of FileSystemRMStateStore in production.
- ApplicationTimelineSever (also known as Application History Server): Cloudera does not support ApplicationTimelineServer v1. ApplicationTimelineServer v2 is under development and Cloudera does not currently support it.
- Scheduler Reservations: Scheduler reservations are currently at an experimental stage, and Cloudera does not support their use in production.
- Scheduler node-labels: Node-labels are currently experimental with CapacityScheduler. Cloudera does not support their use in production.
Starting an unmanaged ApplicationMaster may fail
Starting a custom Unmanaged ApplicationMaster may fail due to a race in getting the necessary tokens.
Workaround: Try to get the tokens again; the custom unmanaged ApplicationMaster should be able to fetch the necessary tokens and start successfully.
Job movement between queues does not persist across ResourceManager restart
CDH 5 adds the capability to move a submitted application to a different scheduler queue. This queue placement is not persisted across ResourceManager restart or failover, which resumes the application in the original queue.
Workaround: After ResourceManager restart, re-issue previously issued move requests.
No JobTracker becomes active if both JobTrackers are migrated to other hosts
If JobTrackers in an High Availability configuration are shut down, migrated to new hosts, then restarted, no JobTracker becomes active. The logs show a Mismatched address exception.
$ zkCli.sh rmr /hadoop-ha/<logical name>
Hadoop Pipes may not be usable in an MRv1 Hadoop installation done through tarballs
Under MRv1, MapReduce's C++ interface, Hadoop Pipes, may not be usable with a Hadoop installation done through tarballs unless you build the C++ code on the operating system you are using.
Workaround: Build the C++ code on the operating system you are using. The C++ code is present under src/c++ in the tarball.
Task-completed percentage may be reported as slightly under 100% in the web UI, even when all of a job's tasks have successfully completed.
Oozie workflows will not be recovered in the event of a JobTracker failover on a secure cluster
Delegation tokens created by clients (via JobClient#getDelegationToken()) do not persist when the JobTracker fails over. This limitation means that Oozie workflows will not be recovered successfully in the event of a failover on a secure cluster.
Workaround: Re-submit the workflow.
Encrypted shuffle in MRv2 does not work if used with LinuxContainerExecutor and encrypted web UIs.
In MRv2, if the LinuxContainerExecutor is used (usually as part of Kerberos security), and hadoop.ssl.enabled is set to true (See Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport), then the encrypted shuffle does not work and the submitted job fails.
Workaround: Use encrypted shuffle with Kerberos security without encrypted web UIs, or use encrypted shuffle with encrypted web UIs without Kerberos security.
Link from ResourceManager to Application Master does not work when the Web UI over HTTPS feature is enabled.
In MRv2 (YARN), if hadoop.ssl.enabled is set to true (use HTTPS for web UIs), then the link from the ResourceManager to the running MapReduce Application Master fails with an HTTP Error 500 because of a PKIX exception.
A job can still be run successfully, and, when it finishes, the link to the job history does work.
Workaround: Don't use encrypted web UIs.
Hadoop client JARs don't provide all the classes needed for clean compilation of client code
$ javac -cp '/usr/lib/hadoop/client/*' -d wordcount_classes WordCount.java org/apache/hadoop/fs/Path.class(org/apache/hadoop/fs:Path.class): warning: Cannot find annotation method 'value()' in type 'org.apache.hadoop.classification.InterfaceAudience.LimitedPrivate': class file for org.apache.hadoop.classification.InterfaceAudience not found 1 warning
The ulimits setting in /etc/security/limits.conf is applied to the wrong user if security is enabled.
Anticipated Resolution: None
Workaround: To increase the ulimits applied to DataNodes, you must change the ulimit settings for the root user, not the hdfs user.
Must set yarn.resourcemanager.scheduler.address to routable host:port when submitting a job from the ResourceManager
When you submit a job from the ResourceManager, yarn.resourcemanager.scheduler.address must be set to a real, routable address, not the wildcard 0.0.0.0.
Workaround: Set the address, in the form host:port, either in the client-side configuration, or on the command line when you submit the job.
Amazon S3 copy may time out
The Amazon S3 filesystem does not support renaming files, and performs a copy operation instead. If the file to be moved is very large, the operation can time out because S3 does not report progress to the TaskTracker during the operation.
Workaround: Use -Dmapred.task.timeout=15000000 to increase the MR task timeout.
Task Controller Changed from DefaultTaskController to LinuxTaskController
<property> <name>mapreduce.tasktracker.taskcontroller</name> <value>org.apache.hadoop.mapred.DefaultTaskController</value> </property>
Out-of-memory errors may occur with Oracle JDK 1.8
The total JVM memory footprint for JDK8 can be larger than that of JDK7 in some cases. This may result in out-of-memory errors.
Workaround: Increase max default heap size (-Xmx). In the case of MapReduce, for example, increase Reduce Task Maximum Heap Size in Cloudera Manager (mapred.reduce.child.java.opts, or mapreduce.reduce.java.opts for YARN) to avoid out-of-memory errors during the shuffle phase.
hadoop-test.jar has been renamed to hadoop-test-mr1.jar
As of CDH 5.4.0, hadoop-test.jar has been renamed to hadoop-test-mr1.jar. This JAR file contains the mrbench, TestDFSIO, and nnbench tests.
ResourceManager might not transition to active after an upgrade from CDH5.3 to CDH5.4
If the state-store is corrupted, the ResourceManager might not transition to active on an upgrade/restart. YARN-2834 (present in CDH5.4.0 and higher) prevents such corruption, so this does not affect upgrades from CDH5.4.0 and higher.
Workaround: Format the state-store to clear out all corrupt applications in the store.
Jobs in pool with DRF policy will not run if root pool is FAIR
If a child pool using DRF policy has a parent pool using Fairshare policy, jobs submitted to the child pool do not run.
Workaround:Change parent pool to use DRF.
|<< Apache Flume Known Issues||Apache HBase Known Issues >>|