hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang" <wbwo...@gmail.com>
Subject \r\n problem in LineRecordReader.java
Date Tue, 12 Jun 2007 17:35:57 GMT
I wrote a map/reduce to process an ascii input file, and noticed that
"Map input bytes" stats does not match the actual file size (only
8.5MB read on a 52MB file).   I simplified the program to an identity
map/reduce
   public static void main(String[] args) throws IOException {
      JobConf conf = new JobConf(CanonicalizeFriends.class);
      conf.setJobName("Identity Text Tester");
      JobClient.runJob(conf);
    }

and still have the same problem.

It turned out that this input file has "\r\n" as line breaks. If those
are replaced with "\n", then the "Map input bytes" matches properly.
The problem seems to be related to the read-till-newline code in
constructor, where a call to readLine is made with FSDataInputStream
("fileIn") instead of a BufferedInputStream.

  public LineRecordReader(Configuration job, FileSplit split)
      ....
    } else if (start != 0) {
      fileIn.seek(start - 1);
      LineRecordReader.readLine(fileIn, null);
      start = fileIn.getPos();
    }

    this.in = new BufferedInputStream(in);

if I were to replace it with using BufferedInputStream for all
readLine() calls as in
    boolean skipFirstLine = false;
    ...
    } else if (start != 0) {
      --start;
      fileIn.seek(start);
      skipFirstLine = true;
    }

    this.in = new BufferedInputStream(in);
    if (skipFirstLine) {
      start += LineRecordReader.readLine(this.in, null);
    }

the problem goes away.    It seems that readLine() calling
    in.mark(1);
with fileIn (FSDataInputStream) somehow corrupts the underlying
buffer.   This was checked with original code and by
   1. just comment out "in.mark(1)" and since this input file always
have \r\n, mark()/reset() is not needed on this input.   This case
gets the correct number of input bytes.

OR

   2. leaving original code as is, and throw an exception if "\n" is
not followed by "\r".  In this case, the exception does get thrown.  I
instrumented the code to output its file position.  While it
accurately pin-point where "\r" is, the original file does have an
"\n" right after it.

----------

I dig around the code for a few hours but wasn't fluent enough to
pin-point why this is the case.   any suggestions?

thanks

bwolen

Mime
View raw message