hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-930) merge join should handle compressed bz2 sorted files
Date Tue, 25 Aug 2009 01:33:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747172#action_12747172

Pradeep Kamath commented on PIG-930:

A couple of proposals:

1)       We record in the index (key, startOfBlockOffset, filename). This way in the join
map job we always start at the beginning of the block in the right file and look for the key
and we should find it. Unfortunately we may find many keys which will not survive the pipeline
till we find the key in the index.

2)       We record in the index (key, startOfBlockOffset, pos, filename). We then use startOfBlockOffset
to bindTo in the right file. We then repeatedly call getPosition() and getNext() till getPosition
== pos. At this point the tuple returned by next geNext() would be the right tuple with the
key we want.

The approach 1) above seems safer given that getPosition() in bz2 case is a little inaccurate.
This option may have a performance penalty though. We may want to go this approach and optimize
later if needed. Not sure if this will have implications on outer join. 

> merge join should handle compressed bz2 sorted files
> ----------------------------------------------------
>                 Key: PIG-930
>                 URL: https://issues.apache.org/jira/browse/PIG-930
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Pradeep Kamath
> There are two issues - POLoad which is used to read the right side input does not handle
bz2 files right now. This needs to be fixed.
> Further inn the index map job we bindTo(startOfBlockOffSet) (this will internally discard
first tuple if offset > 0). Then we do the following:
> {noformat}
> While(tuple survives pipeline) {
>   Pos =  getPosition()
>   getNext() 
>   run the tuple  through pipeline in the right side which could have filter
> }
> Emit(key, pos, filename).
> {noformat}
> Then in the map job which does the join, we bindTo(pos > 0 ? pos  1 : pos) (we do
pos -1 because bindTo will discard first tuple for pos> 0). Then we do getNext()
> Now in bz2 compressed files, getPosition() returns a position which is not really accurate.
The problem is it could be a position in the middle of a compressed bz2 block. Then when we
use that position to bindTo() in the final map job, the code would first hunt for a bz2 block
header thus skipping the whole current bz2 block. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message