hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Created] (HIVE-4926) Queries which specify clustered-by keys as constants will still scan all buckets
Date Wed, 24 Jul 2013 19:01:48 GMT
Gopal V created HIVE-4926:

             Summary: Queries which specify clustered-by keys as constants will still scan
all buckets
                 Key: HIVE-4926
             Project: Hive
          Issue Type: Improvement
    Affects Versions: 0.12.0
            Reporter: Gopal V

When tables are CLUSTERED BY (key) into multiple buckets, a query which specifies a key in
the query predicate will still scan all buckets in the directory.

In the ideal scenario, only bucket needs to be inspected for a given key, particularly if
hive.enforce.bucketing is turned on.

When a simple filter query like the following is run

select * from store_sales where ss_item_sk = 1;

The log files contain

{code} Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000005_0 Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000006_0 Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000007_0 Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000008_0 Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000009_0

This is going through 32x the amount of data, compared to the right approach of scanning only
the partitions which match the predicate.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message