pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Thu, 22 Jan 2009 18:15:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666234#action_12666234

Santhosh Srinivasan commented on PIG-545:

Two, just getting better sampling won't resolve the issue for order by queries that have one
or a few keys with a very high number of values, such as in a zipf distribution. Unfortunately
for us, zipf is a very common data distribution. In this case our partitioner may need to
be able to detect and split large keys by round robining them to a group of reducers.

Better sampling will not resolve the issue for order by. It will help in having more evenly
sized partitions for the reducers. Since its sampling and not brute force approach of checking
out the cardinality of each key, there will always be a non-zero probability of one reducer
getting more data than the other reducers. The better sampling approach will minimize such

Secondly, post sampling, we can ensure that reducers get the right partitions by using Hadoop's
ability to pick reducers based on partition functions. I am not quite sure how Pig can propose
a generic partition function to achieve this.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message