hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim White <...@pagesmiths.com>
Subject TextInputFormat bug - lines which are *not* split
Date Mon, 21 Aug 2006 17:23:31 GMT
As I don't have Hadoop installed (yet), I'm not able to offer a test
case, but I'm fairly confident of a bug line TextInputFormat.

The current implementation will ignore the first line of a file split
when the previous split ended with a newline.

There are two ways to fix this, the easiest (and most efficient) is for
the preceding split to always read up to the first newline in the
succeeding split.



public class TextInputFormat extends InputFormatBase {
    return new RecordReader() {
        /** Read a line. */
        public synchronized boolean next(Writable key, Writable value)
          throws IOException {
          long pos = in.getPos();
          if (pos >= end)
            return false;


          if (pos > end)
            return false;

will do the trick.


View raw message