hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key
Date Wed, 01 Sep 2010 20:51:53 GMT
ORDER BY distribution is uneven when record size is correlated with order key
-----------------------------------------------------------------------------

                 Key: PIG-1592
                 URL: https://issues.apache.org/jira/browse/PIG-1592
             Project: Pig
          Issue Type: Improvement
            Reporter: Dmitriy V. Ryaboy
             Fix For: 0.9.0


The partitioner contributed in PIG-545 distributes the order key space between partitions
so that each partition gets approximately the same number of keys, even when the keys have
a non-uniform distribution over the key space.

Unfortunately this still allows for severe partition imbalance when record size is correlated
with the order key. By way of motivating example, consider this script which attempts to produce
a list of genuses based on how many species each genus contains:

{code}
set default_parallel 60;
critters = load 'biodata'' as (genus, species);
genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters)
as num_species, critters;
ordered_genuses = order genus_counts by num_species desc;
store ordered_genuses....
{code}

The higher the value of genus_counts, the more species tuples will be contained in the critters
bag, the wider the row. This can cause a severe processing imbalance, as the partitioner processing
the records with the highest values of genus_counts will have the same number of *records*
as the partitioner processing the lowest number, but it will have far more actual *bytes*
to work on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message