hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-930) merge join should handle compressed bz2 sorted files
Date Thu, 27 Aug 2009 17:34:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748468#action_12748468

Pradeep Kamath commented on PIG-930:

I had spoken to Ben (who wrote the bzip2 code) and the position returned from getPosition()
starts off being an offset on the compressed bz2 file and then becomes a counter on the uncompressed
stream - so it is inaccurate in that it is neither on the compressed nor the uncompressed
stream but a best effort inbetween position. Also during compression a single byte could mean
multiple uncompressed bytes or viceversa. So getting accurate position on the data so we can
get the very next tuple would be difficult.

I think besides this, the fact that we do bindTo(pos > 0 ? pos - 1 : pos) (we do pos -1
because bindTo will discard first tuple for pos> 0) is not very clean. We cannot always
assume that 1 byte less than the position suggested by the index is the right position to
bindTo so that we correctly get to the tuple in the index. (For example if the delimiter is
multi byte, the loader may discard the tuple we want to get to!). Approach 2) outlined above
will avoid this hack since we will bind to startOfDfsBlock and then do getPOsition() and getNext()
repeatedly till we reach the position suggested in the index. The next getNext() should give
us the exact same key as in the index since the index creation code follows the same sequence
of bindTo()-> getPosition() -> getNext().

> merge join should handle compressed bz2 sorted files
> ----------------------------------------------------
>                 Key: PIG-930
>                 URL: https://issues.apache.org/jira/browse/PIG-930
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Pradeep Kamath
> There are two issues - POLoad which is used to read the right side input does not handle
bz2 files right now. This needs to be fixed.
> Further inn the index map job we bindTo(startOfBlockOffSet) (this will internally discard
first tuple if offset > 0). Then we do the following:
> {noformat}
> While(tuple survives pipeline) {
>   Pos =  getPosition()
>   getNext() 
>   run the tuple  through pipeline in the right side which could have filter
> }
> Emit(key, pos, filename).
> {noformat}
> Then in the map job which does the join, we bindTo(pos > 0 ? pos  1 : pos) (we do
pos -1 because bindTo will discard first tuple for pos> 0). Then we do getNext()
> Now in bz2 compressed files, getPosition() returns a position which is not really accurate.
The problem is it could be a position in the middle of a compressed bz2 block. Then when we
use that position to bindTo() in the final map job, the code would first hunt for a bz2 block
header thus skipping the whole current bz2 block. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message