Apache Pig Known Issues
Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when filtering Parquet data written by Impala
Apache Hive and Apache Spark rely on Apache Parquet's parquet-mr Java library to perform filtering of Parquet data stored in row groups. Those row groups contain statistics that make the filtering efficient without having to examine every value within the row group.
Recent versions of the parquet-mr library contain a bug described in PARQUET-1217. This bug causes filtering to behave incorrectly if only some of the statistics for a row group are written. Starting in CDH 5.13, Apache Impala populates statistics in this way for Parquet files. As a result, Hive and Spark may incorrectly filter Parquet data that is written by Impala.
In CDH 5.13, Impala started writing Parquet's null_count metadata field without writing the min and max fields. This is valid, but it triggers the PARQUET-1217 bug in the predicate push-down code of the Parquet Java library (parquet-mr). If the null_count field is set to a non-zero value, parquet-mr assumes that min and max are also set and reads them without checking whether they are actually there. If those fields are not set, parquet-mr reads their default value instead.
For integer SQL types, the default value is 0, so parquet-mr incorrectly assumes that the min and max values are both 0. This causes the problem when filtering data. Unless the value 0 itself matches the search condition, all row groups are discarded due to the incorrect min/max values, which leads to missing results.
- Custom MapReduce jobs
- CDH 5.13.0, 5.13.1, 5.13.2, and 5.14.0
- CDS 2.2 Release 2 Powered by Apache Spark and earlier releases on CDH 5.13.0 and later
Who Is Affected: Anyone writing Parquet files with Impala and reading them back with Hive, Spark, or other Java-based components that use the parquet-mr libraries for reading Parquet files.
Severity (Low/Medium/High): High
Impact: Parquet files containing null values for integer fields written by Impala produce missing results in Hive, Spark, and other Java applications when filtering by the integer field.
You should upgrade to one of the fixed maintenance releases mentioned below.
WorkaroundThis issue can be avoided at the price of performance by disabling predicate push-down optimizations:
In Hive, use the following SET command:
SET hive.optimize.ppd = false;
In Spark, disable the following configuration setting:
- CDH 5.13.3 and higher
- CDH 5.14.2 and higher
- CDH 5.15.0 and higher
- CDS 2.3 Release 2 and higher
For the latest update on this issue, see the corresponding Knowledge Base article:
Hive, Pig, and Sqoop 1 fail in MRv1 tarball installation because /usr/bin/hbase sets HADOOP_MAPRED_HOME to MRv2
This problem affects tarball installations only.
Cloudera Bug: CDH-6640
Resolution: Use workaround.
In addition, /usr/lib/hadoop-mapreduce must not exist in HADOOP_CLASSPATH.
Pig fails to read Parquet file (created with Hive) with a complex field if schema not specified explicitly
Affected Versions: CDH 5.1 - 5.12
Fixed in Versions: CDH 5.13 and higher
Cloudera Bug: CDH-15607
Workaround: Provide the schema of the fields in the LOAD statement.