hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
Date Thu, 10 Sep 2009 21:34:57 GMT

    [ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753828#action_12753828

Dmitriy V. Ryaboy commented on PIG-953:

bq. Pradeep: Pig only guarantees order with limit following order - for any other relational
operator following order there are no guarantees. Today it is true that filter or a column
pruning foreach would also preserve order but this can change if needed in the future. There
explicit code to ensure order-limit combination works by preserving order - there is no such
explicit check for other operators (keeping it open for change in the future)

That actually tells me that an orderPreserving property on a LogicalOperator is a really good
That way we can set it to true on all operators that are at the moment order-preserving (limit,
filter, column-prining foreach), and not commit to forever maintaining that contract. If filter
starts changing order, the patch will simply have to include a change to set orderPreserving
to false in POFilter, and everything will work automagically.

> Enable merge join in pig to work with loaders and store functions which can internally
index sorted data 
> ---------------------------------------------------------------------------------------------------------
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953.patch
> Currently merge join implementation in pig includes construction of an index on sorted
data and use of that index to seek into the "right input" to efficiently perform the join
operation. Some loaders (notably the zebra loader) internally implement an index on sorted
data and can perform this seek efficiently using their index. So the use of the index needs
to be abstracted in such a way that when the loader supports indexing, pig uses it (indirectly
through the loader) and does not construct an index. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message