hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Thu, 05 Feb 2009 12:36:01 GMT

     [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shravan Matthur Narayanamurthy updated PIG-545:
-----------------------------------------------

    Attachment: WRP.patch

This patch implements the Weighted Range Partitioner as detailed in the Dewitt et. al. paper
on Practical Skew Handling in Parallel Joins. The JobControlCompiler has been modified to
use the new partitioner for order by. So the old unit tests should be valid.

One caveat is that we need to mention the number of reducers via the parallel keyword when
doing order by. Currently, if you don't specify it by default there will just be one partition
and it messes up the distribution. We need to do something about this. Another thing is when
the Partitioner gets configured it reads the entire sample file from HDFS but it currently
doesn't do any reporting as I could not think of a way to do it right now

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: WRP.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message