hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Fri, 06 Feb 2009 16:05:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671177#action_12671177

Shravan Matthur Narayanamurthy commented on PIG-545:

Thanks for running the patch Alan. I figured out the IndexOutOfBounds exception & fixed
it. That should not happen.

I was also working on the L10 issue. I tried it outside of Pig by sending it tuples(int,string)
with ordering required as (desc,asc). It works fine. So I don't think there is any problem
with the partitioner there. Most of the things like asc, desc & user comparator should
be handled as I use the comparator passed to me through the jobConf.  So I checked the samples
file that was generated. Its not sorted at all. The main assumption is invalid and the partitioner
will definitely get messed up.

I finally figured that the way we are doing the compilation of order by in MRCompiler is wrong.
When we do the nested sort using the input POSort, we are converting it into "order by *"
instead it should be "order by $0, $1, $2 ..."

I have started L10 with the changes. WIll update with the results.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>         Attachments: WRP.patch
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message