hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key
Date Wed, 01 Sep 2010 20:56:54 GMT

    [ https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905220#action_12905220

Dmitriy V. Ryaboy commented on PIG-1592:

One proposal is to simply change the default weighted range partitioner to take into account
the record size. If record size is uniform, or uniformly distributed, or non-uniformly distributed
but independent of the order key, this change shouldn't materially affect the distributions
created for data sets not covered by this issue.

> ORDER BY distribution is uneven when record size is correlated with order key
> -----------------------------------------------------------------------------
>                 Key: PIG-1592
>                 URL: https://issues.apache.org/jira/browse/PIG-1592
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>             Fix For: 0.9.0
> The partitioner contributed in PIG-545 distributes the order key space between partitions
so that each partition gets approximately the same number of keys, even when the keys have
a non-uniform distribution over the key space.
> Unfortunately this still allows for severe partition imbalance when record size is correlated
with the order key. By way of motivating example, consider this script which attempts to produce
a list of genuses based on how many species each genus contains:
> {code}
> set default_parallel 60;
> critters = load 'biodata'' as (genus, species);
> genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters)
as num_species, critters;
> ordered_genuses = order genus_counts by num_species desc;
> store ordered_genuses....
> {code}
> The higher the value of genus_counts, the more species tuples will be contained in the
critters bag, the wider the row. This can cause a severe processing imbalance, as the partitioner
processing the records with the highest values of genus_counts will have the same number of
*records* as the partitioner processing the lowest number, but it will have far more actual
*bytes* to work on.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message