hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdul Qadeer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4010) Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
Date Fri, 19 Sep 2008 01:46:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632486#action_12632486

Abdul Qadeer commented on HADOOP-4010:

Just to make sure I understand correctly, you mean
that if there are two splits such that

a b c \r  is one split while
\n d e f \r \n g h i \r \n is the second split.

start = 0; end = 3  for the first split
start = 3; end = 14 for the second split

For Split 1:

(1) Constructor will not throw away first line because
start != 0 will fail.
(2) In the next method, the first read line will return
abc and current pos = 5 (i.e. points to d)
So in the next iteration of next(), the check that
while (pos <= end) will fail because pos = 5; end = 3

For Split 2:
(1) Constructor will try to throw first line.  After that
pos = 5 (i.e. points to d)
(2) next() will read def and gfi

So it looks okay to me?  Have I missed something?

> Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
> --------------------------------------------------------------------------------------
>                 Key: HADOOP-4010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.19.0
>         Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch, Hadoop-4010_version3.patch
> The current algorithm of the LineRecordReader needs to move backwards in the stream (in
its constructor) to correctly position itself in the stream.  So it moves back one byte from
the start of its split and try to read a record (i.e. a line) and throws that away.  This
is so because it is sure that, this line would be taken care of by some other mapper.  This
algorithm is difficult and in-efficient if used for compressed stream where data is coming
to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does
not split a compressed file and only makes one split from the start to the end of the file
and so only one mapper handles it.  We are currently working on BZip2 codecs where splitting
is possible to work with Hadoop.  So this proposed change will make it possible to uniformly
handle plain as well as compressed stream.)
> In the new algorithm, each mapper always skips its first line because it is sure that,
that line would have been read by some other mapper.  So now each mapper must finish its reading
at a record boundary which is always beyond its upper split limit.  Due to this change, LineRecordReader
does not need to move backwards in the stream.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message