hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1602) List Partitioning
Date Fri, 27 Aug 2010 21:53:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903664#action_12903664
] 

Ning Zhang commented on HIVE-1602:
----------------------------------

@joydeep, this is intended to be an open ended discussions about how to tackle partition skews.
Combining small partitions into one large partitions seems to be a natural way. May be the
name of list partition is not so obvious, but I meant to map a list of values from the DP
column to one partition rather than a 1-to-1 mapping.

HAR is one option and we can keep the partition spec as part of the file name so that the
actual column is not stored. 

Another way is to store the partition column value in the data file itself if the partition
corresponds to a list of values. 

> the user can do a one time analysis of the data (for size distribution on different partitioning
columns) and then generate the clumping logic manually.

The problem is that there is no way that the user can manually cluster data with different
partition column values. for example, if event is a DP column and you find a couple of large
partitions event = {'l', 'g'}, and a 3 small partitions event = {'s', 'm', 'l'}. How can the
user manually cluster event=s, event=m, event=l into one? If there are a lot of these small
partitions it introduces a lot of problems in HDFS, metastore, and Hive client side. 

> List Partitioning
> -----------------
>
>                 Key: HIVE-1602
>                 URL: https://issues.apache.org/jira/browse/HIVE-1602
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition column values.
Currently it creates one partition for each distinct DP column value. This could result in
skews in the created dynamic partitions in that some partitions are large but there could
be large number of small partitions as well. This results in burdens in HDFS as well as metastore.
A list partitioning scheme that aggregate a number of small partitions into one big one is
more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message