hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
Date Tue, 20 Jan 2009 18:40:59 GMT

     [ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates updated PIG-460:

    Attachment: sampler.patch

Attaching patch for Amir who is currently out.  Not marking as patch available as I believe
Amir wanted to do some performance testing before declaring it ready.

> PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
> ------------------------------------------------------------
>                 Key: PIG-460
>                 URL: https://issues.apache.org/jira/browse/PIG-460
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>         Attachments: sampler.patch
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so needs the data
in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being in a loader
then jobs 1 and 2 could be merged.  On average job 1 takes about 15% of the time of an order
by script.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message