hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
Date Sun, 27 Sep 2009 22:16:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760116#action_12760116

Ashutosh Chauhan commented on PIG-953:

Changes look good. Couple of points:

bq. I think this internal structure at this point does not need to be optimized for lookup

Well, its less about optimization and more about maintainability. First the relationship between
two parallel arrays is implicit. So, if someone is reading that code he needs to "understand"
that relationship of his own. If there is only one structure relationship would be explicit.
Second, there is quite a bit of  code around it, which IMO will be simplified if a single
data structure is instead used. That said, either approach works just as fine so I will leave
it upto you. 

bq. Zebra needs column names and cannot work with positions

That is then the limitation of Zebra which it should overcome someone point in time. There
might be a good reason for it, but I fail to see what extra information names of column provides
where type and position of columns should be sufficient. This also implies an additional requirement
on user. If data is stored using ZebraStorage and if later is loaded back, then user has to
provide the same names for columns that he gave while storing it. No such constraint exists
for any other load-store like PigStorage.

> Enable merge join in pig to work with loaders and store functions which can internally
index sorted data 
> ---------------------------------------------------------------------------------------------------------
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953-2.patch, PIG-953.patch
> Currently merge join implementation in pig includes construction of an index on sorted
data and use of that index to seek into the "right input" to efficiently perform the join
operation. Some loaders (notably the zebra loader) internally implement an index on sorted
data and can perform this seek efficiently using their index. So the use of the index needs
to be abstracted in such a way that when the loader supports indexing, pig uses it (indirectly
through the loader) and does not construct an index. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message