hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yibing Shi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16291) Hive fails when unions a parquet table with itself
Date Fri, 24 Mar 2017 13:57:41 GMT
Yibing Shi created HIVE-16291:
---------------------------------

             Summary: Hive fails when unions a parquet table with itself
                 Key: HIVE-16291
                 URL: https://issues.apache.org/jira/browse/HIVE-16291
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: Yibing Shi
         Attachments: HIVE-16291.1.patch

Reproduce commands:

{code:sql}
create table tst_unin (col1 int) partitioned by (p_tdate int) stored as parquet;
insert into tst_unin partition (p_tdate=201603) values (20160312), (20160310);
insert into tst_unin partition (p_tdate=201604) values (20160412), (20160410);
select count(*) from (select tst_unin.p_tdate from tst_unin union all select tst_unin.p_tdate
from tst_unin where tst_unin.col1=20160302) t1;
{code}

The table is stored in Parquet format, which is a columnar file format. Hive tries to push
the query predicates to the table scan operators so that only the needed columns are read.
This is done by adding the needed column IDs into job configuration with property "hive.io.file.readcolumn.ids".

In above case, the query unions the result of 2 subqueries, which select data from one same
table. The first subquery doesn't need any column from Parquet file, while the second subquery
needs a column "col1". Hive has a bug here, it finally set "hive.io.file.readcolumn.ids" to
a value like "0,,0", which method ColumnProjectionUtils.getReadColumnIDs cannot parse.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message