hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5170) Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket count
Date Mon, 23 Sep 2013 15:48:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774655#comment-13774655
] 

Gopal V commented on HIVE-5170:
-------------------------------

Tried to do this, unfortunately the FileSinkOperator uses the task-id as the bucket filename.

So if you have 12 reducers, the last reducer will automatically write it to 00011_0.

This makes it slightly more complex to fix this without writing a new SortedFileSinkOperator.
                
> Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket count
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-5170
>                 URL: https://issues.apache.org/jira/browse/HIVE-5170
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.12.0
>         Environment: Ubuntu LXC
>            Reporter: Gopal V
>
> When performing a hive sorted-partitioned insert, the insert optimizer hard-codes the
number of output files to the actual bucket count of the table.
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L4852
> We need at least that many reducers or if limited, switch to multi-spray (as implemented
already), but more reducers is wasteful as long as the HiveKey only contains the partition
columns.
> At this point, we're limited to reducers = n-bucket still, which is a problem for partitioning
requests which need to insert nearly a terabyte of data into a single-digit bucket count and
four-digit partition count.
> Since that is routed by the hasCode of the HiveKey, we can ensure that works by modifying
the HiveKey to handle n-buckets internally.
> Basically it should only generate hashCode = (sort_cols.hashCode() % n) routing only
to n reducers over-all, despite how many we spin up.
> So far so good with the hard-coded reducer count.
> But provided we fix the issues brought up by HIVE-5169, the insert becomes friendlier
to a higher reducer count as well.
> At this juncture, we can modify the hashCode to be slightly more interesting.
> hashCode = (part_cols.hashCode()*31 + (sort_cols.hashCode() % n)) 
> This generates somewhere between n to partition_count * n unique hash-codes.
> Since the sort-order & bucketing has to be maintained per-partition dir, distributing
this equally across any number of reducers will result in the scale-out of the reducer count.
> This will allow a reducer count that will allow for far faster inserts of ORC data into
a partitioned/sorted table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message