hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-845) PERFORMANCE: Merge Join
Date Mon, 10 Aug 2009 00:02:14 GMT

     [ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated PIG-845:
---------------------------------

    Attachment: merge-join-1.patch

Specification: http://wiki.apache.org/pig/PigMergeJoin

Updated patch with following enhancements:

Performance:
a) Got completely rid of POForEach for generating joined output tuples.
b) Creating output tuple of required size and then doing set instead of append.
c) Caching of key as suggested by Pradeep in previous comment.
d) Creating new arraylist for holding buffered left tuples instead of clearing it, thus avoiding
resizing of array.
  
Functionality:
a) Added typecasting for index keys, thus making join work when schemas are supplied.
b) Refactored visit(LOJoin loj) in LogToPhyTranslationVisitor to avoid duplicate code.

Error Handling:
a) Better error handling at various places.
b) Added validateMergeJoin() in LogToPhyTranslationVisitor to generate exception where Merge
Join cant be used.
c) Added more tests.

Limitations:
Merge Join doesn't work when there are splits, streaming and order-by in predecessors or streaming
is present in successors.
Some of these are related to an issue outlined here: https://issues.apache.org/jira/browse/PIG-858
and requires work in MRCompiler.
Currently we detect these conditions in validateMergeJoin() and fail at compile time.  

> PERFORMANCE: Merge Join
> -----------------------
>
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message