hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-930) merge join should handle compressed bz2 sorted files
Date Thu, 27 Aug 2009 17:38:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748469#action_12748469
] 

Pradeep Kamath commented on PIG-930:
------------------------------------

Not sure why jira shows part of above comment striked out - I meant the entire text to be
part of the comment.

> merge join should handle compressed bz2 sorted files
> ----------------------------------------------------
>
>                 Key: PIG-930
>                 URL: https://issues.apache.org/jira/browse/PIG-930
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Pradeep Kamath
>
> There are two issues - POLoad which is used to read the right side input does not handle
bz2 files right now. This needs to be fixed.
> Further inn the index map job we bindTo(startOfBlockOffSet) (this will internally discard
first tuple if offset > 0). Then we do the following:
> {noformat}
> While(tuple survives pipeline) {
>   Pos =  getPosition()
>   getNext() 
>   run the tuple  through pipeline in the right side which could have filter
> }
> Emit(key, pos, filename).
> {noformat}
>  
> Then in the map job which does the join, we bindTo(pos > 0 ? pos  1 : pos) (we do
pos -1 because bindTo will discard first tuple for pos> 0). Then we do getNext()
> Now in bz2 compressed files, getPosition() returns a position which is not really accurate.
The problem is it could be a position in the middle of a compressed bz2 block. Then when we
use that position to bindTo() in the final map job, the code would first hunt for a bz2 block
header thus skipping the whole current bz2 block. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message