Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E33277C7 for ; Fri, 12 Aug 2011 01:52:11 +0000 (UTC) Received: (qmail 61049 invoked by uid 500); 12 Aug 2011 01:52:11 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 61018 invoked by uid 500); 12 Aug 2011 01:52:10 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 61010 invoked by uid 99); 12 Aug 2011 01:52:10 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Aug 2011 01:52:10 +0000 Received: from localhost (HELO [10.99.117.223]) (127.0.0.1) (smtp-auth username scottcarey, mechanism login) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Aug 2011 01:52:10 +0000 User-Agent: Microsoft-MacOutlook/14.12.0.110505 Date: Thu, 11 Aug 2011 18:53:23 -0700 Subject: Re: why Utf8 (vs String)? From: Scott Carey Sender: Scott Carey To: "user@avro.apache.org" Message-ID: Thread-Topic: why Utf8 (vs String)? In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit Also, Utf8 caches the result of toString(), so that if you call toString() many times, it only allocates the String once. It also implements the CharSequence interface, and many libraries in the JRE accept CharSequence. Note that Utf8 is mutable and exposes its backing store (byte array). String is immutable. Be careful with how you use Utf8 objects if you hold on to them for a long time or pass them to other code -- users should not expect similar characteristics to String for general use. On 8/11/11 5:08 PM, "Yang" wrote: >Thanks a lot Doug > >On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting wrote: >> This is for performance. >> >> A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting, >> without decoding the UTF-8 bytes into characters. A Utf8 may also be >> reused, so when iterating through a large number of values (e.g., in a >> MapReduce job) only a single instance need be allocated, while String >> would require an allocation per iteration. >> >> Note that String may be used when writing data, but that data is >> generally read as Utf8. The toString() method may be called whenever a >> String is required. If only equality or ordering is needed, and not >> substring operations, then leaving values as Utf8 is generally faster >> than converting to String. >> >> Doug >> >> On 08/11/2011 04:36 PM, Yang wrote: >>> if I declare a field to be "string", the generated java implementation >>> uses avro......Utf8 for that, >>> >>> I was wondering what is the thinking behind this, and what is the >>> proper way to use the Utf8 value ----- >>> oftentimes in my logic, I need to compare the value against other >>> String's, or store them into other databases , which >>> of course do not know about Utf8, so that I'd have to transform them >>> into String's. so it seems being Utf8 unnecessarily >>> asks for a lot of transformations. >>> >>> or I guess I'm not getting the correct usage ? >>> >>> Thanks >>> Yang >>