Apache Pig Known Issues

Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when filtering Parquet data written by Impala

Apache Hive and Apache Spark rely on Apache Parquet's parquet-mr Java library to perform filtering of Parquet data stored in row groups. Those row groups contain statistics that make the filtering efficient without having to examine every value within the row group.

Recent versions of the parquet-mr library contain a bug described in PARQUET-1217. This bug causes filtering to behave incorrectly if only some of the statistics for a row group are written. Starting in CDH 5.13, Apache Impala populates statistics in this way for Parquet files. As a result, Hive and Spark may incorrectly filter Parquet data that is written by Impala.

In CDH 5.13, Impala started writing Parquet's null_count metadata field without writing the min and max fields. This is valid, but it triggers the PARQUET-1217 bug in the predicate push-down code of the Parquet Java library (parquet-mr). If the null_count field is set to a non-zero value, parquet-mr assumes that min and max are also set and reads them without checking whether they are actually there. If those fields are not set, parquet-mr reads their default value instead.

For integer SQL types, the default value is 0, so parquet-mr incorrectly assumes that the min and max values are both 0. This causes the problem when filtering data. Unless the value 0 itself matches the search condition, all row groups are discarded due to the incorrect min/max values, which leads to missing results.

Affected Products: The Parquet Java library (parquet-mr) and by extension, all Java applications reading Parquet files, including, but not limited to:
  • Hive
  • Spark
  • Pig
  • Custom MapReduce jobs
Affected Versions:
  • CDH 5.13.0, 5.13.1, 5.13.2, and 5.14.0
  • Cloudera Distribution of Apache Spark 2.2 Release 2 and earlier releases on CDH 5.13.0 and later

Who Is Affected: Anyone writing Parquet files with Impala and reading them back with Hive, Spark, or other Java-based components that use the parquet-mr libraries for reading Parquet files.

Severity (Low/Medium/High): High

Impact: Parquet files containing null values for integer fields written by Impala produce missing results in Hive, Spark, and other Java applications when filtering by the integer field.

Immediate Action Required:
  • Upgrade

    You should upgrade to one of the fixed maintenance releases mentioned below.

  • Workaround

    This issue can be avoided at the price of performance by disabling predicate push-down optimizations:
    • In Hive, use the following SET command:

      SET hive.optimize.ppd = false;

    • In Spark, disable the following configuration setting:

      --conf spark.sql.parquet.filterPushdown=false

Addressed in the Following Releases:
  • CDH 5.13.3 and higher
  • CDH 5.14.2 and higher
  • Cloudera Distribution of Apache Spark 2.3 Release 2 and higher

For the latest update on this issue, see the corresponding Knowledge Base article:

TSB:2018-300: Missing results in Hive, Spark, Pig, and other Java applications when filtering Parquet data written by Impala

Hive, Pig, and Sqoop 1 fail in MRv1 tarball installation because /usr/bin/hbase sets HADOOP_MAPRED_HOME to MRv2

This problem affects tarball installations only.

Cloudera Bug: CDH-6640

Resolution: Use workaround.

Workaround: If you are using MRv1, edit the following line in /etc/default/hadoop from
 export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
to
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce 

In addition, /usr/lib/hadoop-mapreduce must not exist in HADOOP_CLASSPATH.

Pig fails to read Parquet file (created with Hive) with a complex field if schema not specified explicitly

Affected Versions: CDH 5.1 - 5.12

Fixed in Versions: CDH 5.13 and higher

Cloudera Bug: CDH-15607

Workaround: Provide the schema of the fields in the LOAD statement.