hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michel Tourn (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-136) Overlong UTF8's not handled well
Date Fri, 07 Jul 2006 00:12:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-136?page=comments#action_12419627 ] 

Michel Tourn commented on HADOOP-136:
-------------------------------------

FYI:
some info on Java-modified UTF-8 
(this was previously posted)
See Modified UTF-8 in: 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ 

As far as i understand:
The bottom-line is that supplementary UTF-8 characters:
o would be encoded as 4+ bytes in non-Java programs
o but they are already encoded as two Java char-s (i.e. two-bytes) when our converter code
sees them.
o and so the conversion to UTF-8 proceeds on these two chars independently.
o So all the existing Java UTF-8 code that only handles 1..3-bytewide chars is already compliant
with Java-modified UTF-8.

What do the java-i18n experts think?

---
Earlier comment:

Concerning 4-bytes-long UTF-8 characters: 
it seems that this situation does not occur in "Java-modified-UTF8" 

The 4-byte chars are represented as 3+3. 
See Modified UTF-8 in: 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ 


> Overlong UTF8's not handled well
> --------------------------------
>
>          Key: HADOOP-136
>          URL: http://issues.apache.org/jira/browse/HADOOP-136
>      Project: Hadoop
>         Type: Bug

>   Components: io
>     Versions: 0.2.0
>     Reporter: Dick King
>     Assignee: Michel Tourn
>     Priority: Minor
>      Fix For: 0.5.0
>  Attachments: largeutf8.patch
>
> When we feed an overlong string to the UTF8 constructor, two suboptimal things happen.
> First, we truncate to 0xffff/3 characters on the assumption that every character takes
three bytes in UTF8.  This can truncate strings that don't need it, and it can be overoptimistic
since there are characters that render as four bytes in UTF8.
> Second, the code doesn't actually handle four-byte characters.
> Third, there's a behavioral discontinuity.  If the string is "discovered" to be overlong
by the arbitrary limit described above, we truncate with a log message, otherwise we signal
a RuntimeException.  One feels that both forms of truncation should be treated alike.  However,
this issue is concealed by the second issue; the exception will never be thrown because UTF8.utf8Length
can't return more than three times the length of its input.
> I would recommend changing UTF8.utf8Length to let its caller know how many characters
of the input string will actually fit if there's an overflow [perhaps by returning the negative
of that number] and doing the truncation accurately as needed.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message