hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James P. White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-473) TextInputFormat does not correctly handle all line endings
Date Tue, 22 Aug 2006 19:48:14 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-473?page=comments#action_12429789 ] 
James P. White commented on HADOOP-473:

The proposed patch does read-ahead/seek-back even when the initial newline character is '\n'.

That is needlessly wasteful.  The could should be simplified to:

      if (c == '\n')
      if (c == '\r') {
        long curPos = in.getPos();
        char nextC = (char)in.read();
        // EOF -1 is also != '\n' so will work with Mac and EOF
        if (nextC != '\n')) {
          // seek back 1 char because the line ending is only 1 char


> TextInputFormat does not correctly handle all line endings
> ----------------------------------------------------------
>                 Key: HADOOP-473
>                 URL: http://issues.apache.org/jira/browse/HADOOP-473
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.5.0, 0.6.0
>         Environment: All environments
>            Reporter: Dennis Kubes
>         Attachments: text-input-format.patch
> The current TextInputFormat readLine method calls break on either a single '\r' or '\n'
character.  This causes windows formatted text files '\r' '\n' to leave a trailing '\n' character
and the next time the readLine method is called on the same input stream it returns a blank
string.  The patch attached corrects this issue by looking for either single or double character
line endings and positions the input stream to the next line.  It correctly handles windows,
mac, and unix line endings.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message