hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PIG-791) reducing number of MR stages with ORDER BY
Date Wed, 27 May 2009 01:01:46 GMT

     [ https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates resolved PIG-791.

    Resolution: Won't Fix

After some testing by Amir Youssefi we determined that making this change actually makes performance
worse.  Changing RandomSampleLoader into an EvalFunc means that all records in the file have
to be read and parsed.  Since hadoop efficiently supports skipping in the input stream, this
is very expensive.  Instead we will pursue making RandomSampleLoader subsume the user's loader
to avoid requiring a third MR job (see PIG-820).

> reducing number of MR stages with ORDER BY
> ------------------------------------------
>                 Key: PIG-791
>                 URL: https://issues.apache.org/jira/browse/PIG-791
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
> When an order by is not the only operation in a pig script, it is done in two additional
MR jobs. The first job samples using a sampling loader, the second does the sort. The sample
is used to construct a partitioner that equally balances the data in the sort. The sampler
needs to be changed to be a EvalFunc instead of a loader. This way a split can be but in the
proceeding MR job, with the main data being written out and the other part flowing to the
sampler func, which can then write out the sample. The final MR job can then be the sort.

> This change depends on multiquery code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message