hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Wed, 26 Nov 2008 01:12:46 GMT

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650824#action_12650824
] 

Santhosh Srinivasan commented on PIG-545:
-----------------------------------------

The current sampler uses random sampling, assuming uniform distribution of sort keys. Using
Poisson distribution will enable the sampler to figure out the expected value of the distribution
without knowing the actual distribution. This will ensure (more) even distribution of data
for the reducers.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>             Fix For: types_branch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message