hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
Date Thu, 07 May 2009 19:42:46 GMT

    [ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707064#action_12707064

Pradeep Kamath commented on PIG-802:

PIG-744 is a duplicate - will be marking that one as duplicate.

Pasting the summary from PIG-744 which has a little more detail:
Currently order by results in multiple map reduce jobs (2 or 3 depending on the script) of
which the last one does the actual ordering. In this last map reduce job, we create a bag
of values (each value being the entire tuple that is getting sorted) for each sort key(s)
using POPackage in the reduce phase. Then we turn around and flatten the bag in the foreach
following the package. So there is really no need for the bag. But to be generic and use the
existing operators, we can be more efficient by tagging the POPackage to create bags which
are backed by the Hadoop iterator itself. This way we do not create a bag by making a copy
of each tuple from the hadoop iterator. This should help both performance and scalability
by making better use of memory.

> PERFORMANCE: not creating bags for ORDER BY
> -------------------------------------------
>                 Key: PIG-802
>                 URL: https://issues.apache.org/jira/browse/PIG-802
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
> Order by should be changed to not use POPackage to put all of the tuples in a bag on
the reduce side, as the bag is just immediately flattened. It can instead work like join does
for the last input in the join. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message