hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
Date Thu, 25 Sep 2008 16:29:44 GMT
PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
------------------------------------------------------------

                 Key: PIG-460
                 URL: https://issues.apache.org/jira/browse/PIG-460
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Alan Gates
            Assignee: Alan Gates
             Fix For: types_branch


Currently order by is done in three MR jobs:

job 1: read data in whatever loader the user requests, store using BinStorage
job 2: load using RandomSampleLoader, find quantiles
job 3: load data again and sort

It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in
that format to read it.

If the logic in RandomSampleLoader was made into an operator instead of being in a loader
then jobs 1 and 2 could be merged.  On average job 1 takes about 15% of the time of an order
by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message