hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1795) add error option if file-based record-readers fail to consume all input (e.g., concatenated gzip, bzip2)
Date Thu, 27 May 2010 02:49:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872068#action_12872068

Greg Roelofs commented on MAPREDUCE-1795:

It appears that the initial target location for the fix, in LineRecordReader's next() method
(0.20.x) or nextKeyValue() (trunk), isn't actually workable due to buffering.  Ideally one
would be able to check getFilePosition() after hitting the end of the first member/zlib-stream,
notice that it's not equal to the end of file, and optionally throw an error.  However, the
file position, in general, is beyond the end of the zlib-stream, and for small concatenated
inputs it may actually be at the end of file even though the logical offset isn't.   There
doesn't appear to be a way to get at the logical "stream offset" at this level, though if
anyone is aware of a way, please let me know.

In the meantime, we're planning to simply fix the bug (i.e., MAPREDUCE-469), at least for
the native-zlib codec.  A workaround for the Java-zlib alternative is in the 30-AUG-2006 comment
on Sun's bug 4691425 (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425), but without
any explicit license that would allow us to redistribute it as part of Hadoop.  And bzip2
reportedly is already fixed on the trunk (HADOOP-4012).

Barring any new information, I plan to resolve this issue as invalid.

> add error option if file-based record-readers fail to consume all input (e.g., concatenated
gzip, bzip2)
> --------------------------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-1795
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1795
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Greg Roelofs
>            Assignee: Greg Roelofs
> When running MapReduce with concatenated gzip files as input, only the first part ("member"
in gzip spec parlance, http://www.ietf.org/rfc/rfc1952.txt) is read; the remainder is silently
ignored.  As a first step toward fixing that, this issue will add a configurable option to
throw an error in such cases.
> MAPREDUCE-469 is the tracker for the more complete fix/feature, whenever that occurs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message