avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Mitchener <bruce.mitche...@gmail.com>
Subject Re: question about completely untagged data...
Date Mon, 29 Nov 2010 04:44:05 GMT
To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up
the schema from that ID.  It doesn't store each schema separately / entirely
alongside the corresponding data records / entries.

HAvroBase is really pretty nice and has backends for storing data into
things other than HBase...

 - Bruce

On Mon, Nov 29, 2010 at 11:09 AM, Philip Zeyliger <philip@cloudera.com>wrote:

> Hi David,
> Your assessment of Thrift and Avro being isomorphic is correct, and
> you've correctly identified the major philosophical difference.  (It's
> in fact a little bit deeper than you suggest: at read time, there are
> always two schemas available: the reader's schema and the original
> schema that the data was written with.)
> Where are you storing the Avro records?  Avro's binary format for
> records is unlikely to change: it's pretty stable and changing would
> be a big deal.  On the other hand, Avro already has multiple ways for
> passing schema information along.  Avro's RPC implementations do one
> thing.  Avro Data File store the schema in the header.  You could, in
> your system, always store (schema, data) tuples.  That's what Sam is
> doing in HAvroBase
> (
> http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/
> ).
> -- Philip
> On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <davidj@gmail.com> wrote:
> > I have a storage project considering adding Thrift or Avro to for record
> > packing, and I have a couple questions.
> > Other than than type-id and field-ids, Avro and Thrift's designs seem
> > isomorphic. Is the binary format not including field-type-info something
> > that's set in stone, or something that's open for feedback?
> > I prefer the philosophy of Avro, namely to expect schemas to be
> available,
> > use those schemas for compatibility mapping, and to support dynamic
> schema
> > parsing in any supported language. In fact, being able to parse schemas
> > dynamically in any language is the real draw of Avro for me. (personally
> I'd
> > prefer if they were actually Avro IDL, instead of JSON, but I understand
> > that might complicate implementing client stubs).
> > However, the fact that data is not tagged with any type-information is
> > unacceptable dangerous for my application. There will be mechanisms for
> > mapping records to schemas, and schemas will be stored, but if a schema
> were
> > ever lost or corrupted, I can't afford for the data to turn into total
> junk.
> > Unless data is trivially small, encoding a field type wouldn't change the
> > size of the encoding much, but would provide some 'sanity checking' in
> > addition to be able to recover the raw data even if a schema was lost or
> the
> > schema ID for a piece of data was corrupted.
> > Since Avro is relatively new, I'm asking to find out if this is anathama
> to
> > the entire concept of Avro, or something something that was chosen, but
> > might be reconsidered eventually.
> > Going the thrift route for me will mean injecting a bit of the Avro
> > philosophy into Thrift, namely, adding a Thrift IDL parser to the
> language I
> > need, so I can save Thrift IDLs and then dynamically read them. However,
> > doing this as a one-off for my language different than having a supported
> > mechanism for all client languages -- like in Avro.
> >
> >

View raw message