hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim White <...@pagesmiths.com>
Subject TextInputFormat bug - lines which are *not* split
Date Mon, 21 Aug 2006 17:23:31 GMT
As I don't have Hadoop installed (yet), I'm not able to offer a test
case, but I'm fairly confident of a bug line TextInputFormat.

The current implementation will ignore the first line of a file split
when the previous split ended with a newline.

There are two ways to fix this, the easiest (and most efficient) is for
the preceding split to always read up to the first newline in the
succeeding split.

Changing:

http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup

public class TextInputFormat extends InputFormatBase {
...
    return new RecordReader() {
...
        /** Read a line. */
        public synchronized boolean next(Writable key, Writable value)
          throws IOException {
          long pos = in.getPos();
          if (pos >= end)
            return false;

to:

          if (pos > end)
            return false;

will do the trick.

Jim


Mime
View raw message