avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Irving, Dave" <dave.irv...@baml.com>
Subject Specific/GenericDatumReader performance and resolving decoders
Date Thu, 19 Apr 2012 09:09:58 GMT
Hi,

Recently I've been looking at the performance of avros SpecificDatumReaders/Writers. In our
use cases, when deserializing, we find it quite usual for reader / writer schemas to be identical.
Interestingly, GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
So even if constructed with a single (reader/writer) schema, a ResolvingDecoder is still used.
I experimented a little, and wrote a SpecificDatumReader which instead of being hard wired
with a ResolvingDecoder, uses a DecodeStrategy - leaving the reader only dealing with Decoders
directly.
Details follow - but for 'same schema' decodes - the performance difference is impressive.
For the types of records I deal with, a decode with reader schema == writer schema using this
approach is about 1.6x faster than a standard SpecificDatumReader decode.


interface DecodeStrategy
{
  Decoder configureForRead(Decoder in) throws IOException;

  void readComplete() throws IOException;

  void decodeRecordFields(Object old, SpecificRecord record, Schema expected, Decoder in,
SpecificDatumReader2 reader) throws IOException;
}

The idea is that when we hit a record, instead of getting field order from a ResolvingDecoder
directly, we just let the decode strategy do it for us (calling back for each field to the
reader - allowing recursion).
For e.g. when we know reader / writer schemas are identical, and we don't want validation
- an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull the fields direct from
the provided record schema (calling back on the reader for each one):

...

void decodeRecordFields(......)
{
  List<Field> fields = expected.getFields();
  For (int i=0, len = fields.size(); i<len; ++i)
  {
    reader.readField(old, in, field, record);
  }
}

...

The resolving decoder impl of this strategy just does a 'readFieldOrder' like GenericDatumReader
does today.

For each read (given a Decoder), the datum reader lets the decode strategy return back the
actual decoder to be used (via #configureForRead). This means that a resolving implementation
can use this hook to configure the ResolvingDecoder and return this.
The result is that the datum reader can work with same schema / validated schema / resolved
schemas seamlessly without caring about the difference.

I thought I'd share the approach before working on a full patch: Is this an approach you'd
be interested in taking back to core avro? Or is it a little niche? :)

Cheers,

Dave

----------------------------------------------------------------------
This message w/attachments (message) is intended solely for the use of the intended recipient(s)
and may contain information that is privileged, confidential or proprietary. If you are not
an intended recipient, please notify the sender, and then please delete and destroy all copies
and attachments, and be advised that any review or dissemination of, or the taking of any
action in reliance on, the information contained in or attached to this message is prohibited.

Unless specifically indicated, this message is not an offer to sell or a solicitation of any
investment products or other financial product or service, an official confirmation of any
transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept,
monitor, review and retain e-communications (EC) traveling through its networks/systems and
may produce any such EC to regulators, law enforcement, in litigation and as required by law.

The laws of the country of each sender/recipient may impact the handling of EC, and EC may
be archived, supervised and produced in countries other than the country in which you are
located. This message cannot be guaranteed to be secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities
and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value
* Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not
Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional
important disclosures and disclaimers, which you should read. This message is subject to terms
available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the
foregoing.

Mime
View raw message