hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Qadeer (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-4010) Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
Date Sat, 23 Aug 2008 00:40:44 GMT
Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
--------------------------------------------------------------------------------------

                 Key: HADOOP-4010
                 URL: https://issues.apache.org/jira/browse/HADOOP-4010
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.19.0
            Reporter: Abdul Qadeer
            Assignee: Abdul Qadeer
             Fix For: 0.19.0


The current algorithm of the LineRecordReader needs to move backwards in the stream (in its
constructor) to correctly position itself in the stream.  So it moves back one byte from the
start of its split and try to read a record (i.e. a line) and throws that away.  This is so
because it is sure that, this line would be taken care of by some other mapper.  This algorithm
is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader
via some codecs. (Although in the current implementation, Hadoop does not split a compressed
file and only makes one split from the start to the end of the file and so only one mapper
handles it.  We are currently working on BZip2 codecs where splitting is possible to work
with Hadoop.  So this proposed change will make it possible to uniformly handle plain as well
as compressed stream.)

In the new algorithm, each mapper always skips its first line because it is sure that, that
line would have been read by some other mapper.  So now each mapper must finish its reading
at a record boundary which is always beyond its upper split limit.  Due to this change, LineRecordReader
does not need to move backwards in the stream.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message