hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang" <wbwo...@gmail.com>
Subject Re: \r\n problem in LineRecordReader.java
Date Wed, 13 Jun 2007 05:28:03 GMT
Here is probably the cause of this bug:

    public int read(byte b[], int off, int len) throws IOException {
      // make sure that it ends at a checksum boundary
      long curPos = getPos();
      long endPos = len+curPos/bytesPerSum*bytesPerSum;
      return readBuffer(b, off, (int)(endPos-curPos));
    }

Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.
So, endPos-curPos became a negtaive number (e.g., -381).   So,
readBuffer() gets called with a negative length, which naturally, the
underlying DFSInputStream::read() returns 0.  When readBuffer() sees 0
bytes read, it assumes end-of-file, and return -1.   The rest of the
system treat -1 as end-of-file, and hence a large part of the input
file is not read.

I guess small read somehow get triggered by mark()/reset() which
triggered this bug.  Any suggestions for how to fix this?

btw, appended is roughly what the stack looks like on a distributed
file system read.   Hopefully it will help the next person who wants
to understand how read() happens.   Note that FSDataInputStream got
wrapped twice.

bwolen


what happens on a DFS read.  note some of these are subclass call
super class's read().  Some are a class calling its member variable's
read().

  FSDataInputStream::read() == DataInputStream::read()
  Buffer::read()
  BufferedInputStream::read()
  PositionCache::read()
  FSInputChecker::read()
  DFSDataInputStream::read() == FSDataInputStream::read() == DataInputStream...
  Buffer::read()
  BufferedInputStream::read()
  PositionCache::read()
  DFSInputStream::read()
  DataInputStream::read()    (blockStream)
  BufferedInputStream::read()
  socket

Mime
View raw message