hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-845) PERFORMANCE: Merge Join
Date Wed, 12 Aug 2009 07:09:14 GMT

    [ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742245#action_12742245

Alan Gates commented on PIG-845:

Dmitry wrote> Would it make sense to expose this to the users via a 'CREATE INDEX' (or
similar) command?
That way the index could be persisted, and the user could tell you to use an existing index
instead of rescanning the data.

Ashutosh wrote> If we allow that then we also need to deal with managing and persisting
the index. Once Owl is integrated, we could make use of that to do all this for Pig. Till
then, we can continue creating index every time and as I said overhead of index creation is
negligible as compared to run times of actual joins.

My thinking was that at some future point, Pig would automatically cache this sample the first
time it creates it, so that subsequent joins on the same data set could make use of it without
the sample.  I'm hoping we can use Owl for that, as Ashutosh indicated.


Dmitry wrote> I am not sure about the approach of pushing sampling above filters. Have
you guys benchmarked this? Seems like you'd wind up reading the whole file in the sample job
if the filter is selective enough (and high filter selectivity would also make materialize->sample
go much faster).

You want to build your index on the pre-filtered data because your index is telling you what
block to look for the data in.  The fact that the filter may have removed that record doesn't
matter.  It will either be in the block indicated in the index or not present.  Also, you
want to avoid filtering and then building the index because it adds another write and read
of the data (you have to filter, write the data to HDFS, then read it to build the index,
then read it again to do the join).

> -----------------------
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ashutosh Chauhan
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
> Thsi join would work if the data for both tables is sorted on the join key.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message