hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Peña (JIRA) <j...@apache.org>
Subject [jira] [Created] (HIVE-11763) Use * instead of sum(hash(*)) on Parquet predicate (PPD) integration tests
Date Tue, 08 Sep 2015 20:35:45 GMT
Sergio Peña created HIVE-11763:
----------------------------------

             Summary: Use * instead of sum(hash(*)) on Parquet predicate (PPD) integration
tests
                 Key: HIVE-11763
                 URL: https://issues.apache.org/jira/browse/HIVE-11763
             Project: Hive
          Issue Type: Sub-task
            Reporter: Sergio Peña


The integration tests for Parquet predicate push down (PPD) use the following query to validate
the values filtered:
{noformat}
select sum(hash(*)) from ...
{noformat}

It would be better if we use {{select * from ...}} instead to see that those values are correct.
It is difficult to see if a value was filtered by seeing the hash.

Also, we can try to limit the number of rows of the INSERT ... SELECT statmenet to avoid displaying
many rows when validating the data. I think a LIMIT 2 on each of the SELECT.

For example, the parquet_ppd_boolean.ppd has this:
{noformat}
insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee"
as varchar(10)), 0.22, true from src src1 union all select cast("hello" as char(10)), cast("world"
as varchar(10)), 11.22, false from src src2) uniontbl;
{noformat}

If we use LIMIT 2, then we will reduce the # of rows:
{noformat}
insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee"
as varchar(10)), 0.22, true from src src1 LIMIT 2 union all select cast("hello" as char(10)),
cast("world" as varchar(10)), 11.22, false from src src2 LIMIT 2) uniontbl;
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message