hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Sat, 14 Feb 2009 02:13:02 GMT

     [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-545:
-------------------------------

    Attachment: PIG-545-v3.patch

Attached a revised version of the last patch with the following changes:
1) When parallel is not specified the code now consults jobClient to get defaultReduces()
and uses 0.9 times the value as the number of reducers (and hence the number of quantiles)
2) There was a bug in the way order by * was handled in MRCompiler  which is now fixed
3) In WeightedRangePartitioner the basic idea is to first set up the quantiles array as the
last element of a quantile (partition). Then the code iterates over all the sample items and
if it finds an item which equals the quantile element for the partition, then there is a good
chance this item may repeat in the next quantile. The occurences of such sample items in each
partition are recorded to use when deciding which partition such an item in the real data
should go to. The occurences in each partition over the total occurences of such an element
gives the probability that such an element should go to the given partition. In the earlier
version of the patch, to set this up, the code was comparing a sample item with the quantile
element of the next partition instead of the quantile element of the partition in which the
sample element falls (since the quantile element is the last element of the partition, it
should be used in the comparison to decide if this element is likely to crossover to the next
partition). This has been fixed.
4) The earlier patch was not handling the case where number of samples < quantiles - this
is handled now.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-545-v3.patch, WRP.patch, WRP1.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message