hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5777) Support utf-8 text with BOM (byte order marker)
Date Thu, 29 May 2014 09:06:01 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012207#comment-14012207
] 

zhihai xu commented on MAPREDUCE-5777:
--------------------------------------

BC, thanks, It is a great comment.

Yes, the following suggested change looks better
      int newMaxLineLength = Math.min(3L + maxLineLength, Integer.MAX_VALUE);
also It is good to know maxLineLength < 3 never happens.

I want to discuss the other two points:

First whether the BOM marker should be counted as number of bytes in the first line.
It look like these 3 bytes UTF-8 BOM are added to the original document. It didn't belong
to the original document.
BOM has no meaning in UTF-8. Many pieces of software on Microsoft Windows such as Notepad
will  add a BOM to the start when saving text as  UTF-8. Google Docs will add a BOM when a
Microsoft Word document is downloaded as a plain text file.
Google Data API has an UnicodeReader which will skip the BOM. 
For me, I am a little preferring to not count it as number of bytes in the first line because
we try to strip the BOM(treat it the same as no BOM).

Second If we read 3 extra characters for the first line, this theoretically could alter existing
behavior.
Originally I also thought this will be a problem, then I find out the following logic in the
code:
If the return size from readLine is no less than maxLineLength, we will discard the current
line and read the next line.
and also readLine will move file pointer to the next line and copy up to the newMaxLineLength
bytes to Text buffer and return the real line length
(refer to readLine implementation)

     newSize = in.readLine(value, newMaxLineLength,
            Math.max(maxBytesToConsume(pos), newMaxLineLength));
     newSize -= 3; //if find BOM
     if (newSize < maxLineLength) {
        return true;
      }
Based on this logic, if we try to set newMaxLineLength larger than original maxLineLength
in readLine, we won't alter existing behavior.
because the newSize is smaller than maxLineLength and the number of bytes copied to Text buffer
is always no more than newSize.
I should add comment in the code to clarify this confusion.

> Support utf-8 text with BOM (byte order marker)
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5777
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5777
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.22.0, 2.2.0
>            Reporter: bc Wong
>            Assignee: zhihai xu
>         Attachments: MAPREDUCE-5777.patch
>
>
> UTF-8 text may have a BOM. TextInputFormat, KeyValueTextInputFormat and friends should
recognize the BOM and not treat it as actual data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message