hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Thu, 05 Feb 2009 23:03:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670938#action_12670938

Alan Gates commented on PIG-545:

I ran the pigmix L9 (order by of single field) and L10 (order by of multiple fields).  L9
went from 14 minutes to 8, so this patch holds huge promise.  But L10 went from 8 minutes
to 11, so it doesn't seem to be working well in the multiple field case.  (It could also be
related to the fact that L10 uses descending on one of the columns, I don't know if the new
partitioner can handle that or not.)  I also ran our end to end order by tests on it, and
all passed, except bigdata_1, which fails with an IndexOutOfBounds exception in the new WeightedRangePartitioner

As for the caveat that it needs to know the number of reducers up front, I believe in cases
where the user doesn't say parallel, that we can determine the parallelism of the reduces
using JobClient.getDefaultReduces().  We need to double check that this will give us the right
information in both the hod and non-hod cases.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>         Attachments: WRP.patch
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message