hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dick King (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-791) reducing number of MR stages with ORDER BY
Date Thu, 30 Apr 2009 22:21:30 GMT

    [ https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704815#action_12704815
] 

Dick King commented on PIG-791:
-------------------------------

I am considering a modification to hadoop that would allow users to designate that a map/reduce
output is a: sorted, b: likely to be the input to some other map/reduce where selected keys
are re-emitted by the second mapper unchanged, with probability not correlated by the ordering,
and c: the same sort order is used in the second map/reduce.

It would work by writing a sample file as a secondary output of the mapper in the first map/reduce.

This proposal in on my back burner, but could potentially be moved up.

Would that functionality be generally useful here?




> reducing number of MR stages with ORDER BY
> ------------------------------------------
>
>                 Key: PIG-791
>                 URL: https://issues.apache.org/jira/browse/PIG-791
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>
> When an order by is not the only operation in a pig script, it is done in two additional
MR jobs. The first job samples using a sampling loader, the second does the sort. The sample
is used to construct a partitioner that equally balances the data in the sort. The sampler
needs to be changed to be a EvalFunc instead of a loader. This way a split can be but in the
proceeding MR job, with the main data being written out and the other part flowing to the
sampler func, which can then write out the sample. The final MR job can then be the sort.

> This change depends on multiquery code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message