avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eelco Hillenius <eelco.hillen...@gmail.com>
Subject user experience
Date Tue, 25 Aug 2009 22:52:15 GMT

I'd like to share some results I'm having with using Avro. Just fyi :-)

We are using Avro to log 'audit events'. Audit events are basically
simple Java objects with a few properties that describe the audit
event. An example is class SiteNodeDeletedEvent with properties
timeStamp, userId and siteNodeId. Most event classes have between 3 to
8 properties. What I like about doing audit logging like this rather
than just logging string messages, is that it forces us to use data
structures which will be easier to analyze later, and that it will be
much easier to go through our code to find what kind of audit events
we have (all events must extend the AuditEvent base class). We
basically just use Avro to serialize these objects to rolling log
files locally, which are put into HDFS by a daemon separately. We use
Avro's reflection API so that we don't have to deal with code
generation and keep our development model as simple as we can.

Currently we write only eight different events to a database, and this
so far has resulted in a bit over 12 million records. However, I hope
to ramp up what we log, so expect we will soon have trillions of
records. I'd much rather buy more disk space than having to worry
about scaling our database, and I think audit logging is kind of a
natural case for HDFS/ MR, but while I'm at it, why not just making
the logging itself efficient, which is where Avro comes into play.

I wrote a little framework for logging these events, and tested that
with our current records. In that test, I roll over each file after a
million records, so I end up with 13 files (last file only a quarter
million), totaling 121 MB unpacked/ 36 MB gzipped (that framework
typically gzips right after rolling over). So that's 10 MB unpacked/ 3
MB packed per million records. It writes those files, including
reading the records from a local MySQL database and instantiating the
event objects in 4.5 minutes on my MBP. Reading in and instantiating
those events from the log files again costs 1.3 minutes.

In my book, those are pretty good figures for my humble laptop! And
keep in mind that I am using the reflection API; using specific
records probably could eat quite a bit out of the processing time, at
least when it comes to writing. Anyway, I'm sure I won't have any
trouble selling Avro to my colleagues, and I just wanted to share my
experiences in case anyone would be interested. It'd be awesome to
read other's experiences as well. Now on to playing with MR and Pig
etc. :-)



View raw message