pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-350) Join optimization for pipeline rework
Date Thu, 21 Aug 2008 18:06:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich updated PIG-350:
-------------------------------

    Priority: Major  (was: Critical)

reducing priority since so far we have not been able to figure a way to make this code run
efficiently. will revisit later

> Join optimization for pipeline rework
> -------------------------------------
>
>                 Key: PIG-350
>                 URL: https://issues.apache.org/jira/browse/PIG-350
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: join.patch, join2.patch, join3.patch, join4.patch
>
>
> Currently, joins in pig are done as groupings where each input is grouped on the join
key.  In the reduce phase, records from each input are collected into a bag for each key,
and then a cross product done on these bags.  This can be optimized by selecting one (hopefully
the largest) input and streaming through it rather than placing the results in a bag.  This
will result in better memory usage, less spills to disk due to bag overflow, and better performance.
 Ideally, the system would intelligently select which input to stream, based on a histogram
of value distributions for the keys.  Pig does not have that kind of metadata.  So for now
it is best to always pick the same input (first or last) so that the user can select which
input to stream.
> Similarly, order by in pig is done in this same way, with the grouping keys being the
ordering keys, and only one input.  In this case pig still currently collects all the records
for a key into a bag, and then flattens the bag.  This is a total waste, and in some cases
causes significant performance degradation.  The same optimization listed above can address
this case, where the last bag (in this case the only bag) is streamed rather than collected.
> To do these operations, a new POJoinPackage will be needed.  It will replace POPackage
and the following POForEach in these types of scripts, handling pulling the records from hadoop
and streaming them into the pig pipeline.  A visitor will need to be added in the map reduce
compilation phase that detects this case and combines the POPackage with POForeach into this
new POJoinPackage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message