avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hayes <m...@greybird.com>
Subject Re: Why is the String type a Schema property?
Date Thu, 24 May 2012 17:28:50 GMT
On Thu, May 24, 2012 at 9:08 AM, Doug Cutting <cutting@apache.org> wrote:

> On 05/23/2012 09:10 PM, Mark Hayes wrote:
>> So my question is:  Why is the string type a property in the schema,
>> i.e., why does option (2) exist in Avro?  Is there something I'm missing
>> about its benefit?
> It's for back compatibility.  Strings in specific and generic
> representations were originally always read as Utf8, so many existing
> applications expect strings to be Utf8.  Rather than breaking all of these
> applications we instead permitted folks to opt in to this change.  For
> applications that use the specific representation (those that generate
> code) and wish to change from Utf8 to String it requires only adding a
> single parameter to their Maven configuration, so it's not very invasive.
>  The runtime must know which representation is desired for strings, and the
> Schema is the convenient runtime structure to annotate.

Thank you for the reply, Doug!  Your reply has made me think harder about
why this is an issue for us.

I think the reason is that we're storing the schema in the database, with
an internal reference to the schema in each record.  The stored schema is
shared by all clients reading and writing records using that schema, even
though clients operating on the same records may be have very distinct
purposes: an OLTP application, a Map/Reduce job for data analysis, or a
general purpose utility for data viewing.

The stored/shared schema must either have these string type properties, or
not.  If it does have them, this impacts the string type for all clients
reading from the database.  So they would have to all agree on the string
type, or dynamically determine it.

So it is due to this sharing of the schema that I'm tending toward
subclassing the DatumReader.  That way, the string type is divorced from
the shared schema, and each client can decide independently on the string
type it wishes to use.

Does this makes sense to you as well?


View raw message