hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-302) class Text (replacement for class UTF8) was: HADOOP-136
Date Sat, 08 Jul 2006 00:27:31 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12419800 ] 

Hairong Kuang commented on HADOOP-302:

There are two issues with the current implementation of UTF8.

The first is that it does not handle over long string. The length of a string is limited to
a short, not a int. I'd like to address this problem by storing the length of a string in
a variable-length formt. The highest bit of each byte is an extension bit. '1' means that
more bytes are followed, while '0' means last byte.

The second is that the class chooses Java modified UTF8 as the serialized form.  Some argue
that we should use the standard UTF8. It seems to me that serializing a string to Java modified
UTF8 is quite efficient. But it is Java's internal representation. If we want to support inter-programming-language
communication, it makes more sense to use the standard UTF8.

Also for the name of the class, could I use "StringWritable"? It is consistent with other
classes that implement WritableComparable, like IntWritable, FloatWritable etc. 

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

> Just to verify, which length-encoding scheme are we using for class Text (aka LargeUTF8)

> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, which I
think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you need to
support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. there should
be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to return both:

>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and natural) and
there are no magic constants (like -120, -121 -124) 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message