hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Created] (HIVE-4488) BucketizedHiveInputFormat is pessimistic with SMB split generation
Date Fri, 03 May 2013 14:22:24 GMT
Gopal V created HIVE-4488:

             Summary: BucketizedHiveInputFormat is pessimistic with SMB split generation
                 Key: HIVE-4488
             Project: Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.12.0
         Environment: Ubuntu LXC
            Reporter: Gopal V

BucketizedHiveInputFormat generates fewer splits than possible when faced with a table structure
where both tables are partitioned.

When debugging query82 from the TPC-DS spec, there were 7 partitions in the lhs (store_sales)
& 8 partitions in the rhs (inventory), with 1 bucket each.

Only 7 splits are generated from the mapper, instead of a potential 56 mappers.

13/05/01 07:08:22 INFO mapred.FileInputFormat: Total input paths to process : 1
13/05/01 07:08:22 INFO io.BucketizedHiveInputFormat: 7 bucketized splits generated from 344
original splits.

The loop that generates the splits is as follows

        InputSplit[] iss = inputFormat.getSplits(newjob, 0);
        if (iss != null && iss.length > 0) {
          numOrigSplits += iss.length;
          result.add(new BucketizedHiveInputSplit(iss, inputFormatClass

As is clear from above, even though the more granular (per-file/per-partition) splits coming
off the getSplits() is being added to a single bucket split.

Logically, in our mapper we get 

join MergeQueue(

Where ideally, we could've used a CombineFileInputFormat to get node locality for the merge
queue inputs (viz BucketizedHiveInputSplit).

This would be far better in generating splits & in getting more out of short-circuit reads.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message