avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: why Utf8 (vs String)?
Date Fri, 12 Aug 2011 01:53:23 GMT
Also, Utf8 caches the result of toString(), so that if you call toString()
many times, it only allocates the String once.
It also implements the CharSequence interface, and many libraries in the
JRE accept CharSequence.

Note that Utf8 is mutable and exposes its backing store (byte array).
String is immutable.  Be careful with how you use Utf8 objects if you hold
on to them for a long time or pass them to other code -- users should not
expect similar characteristics to String for general use.

On 8/11/11 5:08 PM, "Yang" <teddyyyy123@gmail.com> wrote:

>Thanks  a lot Doug
>On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting <cutting@apache.org> wrote:
>> This is for performance.
>> A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting,
>> without decoding the UTF-8 bytes into characters.  A Utf8 may also be
>> reused, so when iterating through a large number of values (e.g., in a
>> MapReduce job) only a single instance need be allocated, while String
>> would require an allocation per iteration.
>> Note that String may be used when writing data, but that data is
>> generally read as Utf8.  The toString() method may be called whenever a
>> String is required.  If only equality or ordering is needed, and not
>> substring operations, then leaving values as Utf8 is generally faster
>> than converting to String.
>> Doug
>> On 08/11/2011 04:36 PM, Yang wrote:
>>> if I declare a field to be "string", the generated java implementation
>>> uses avro......Utf8 for that,
>>> I was wondering what is the thinking behind this, and what is the
>>> proper way to use the Utf8 value -----
>>> oftentimes in my logic, I need to compare the value against other
>>> String's, or store them into other databases , which
>>> of course do not know about Utf8, so that I'd have to transform them
>>> into String's.  so it seems being Utf8 unnecessarily
>>> asks for a lot of transformations.
>>> or I guess I'm not getting the correct usage ?
>>> Thanks
>>> Yang

View raw message