avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: improve performance of avro map reduce jobs
Date Thu, 30 Jun 2011 17:43:56 GMT

A log file of the task can help identify if most of the time is mapping or sorting.  I suspect
that the comparing while sorting is what is taking a while,  but it could be something else.

Even very crude profiling can be useful.  This can be done with 'kill –3' or 'jstack' to
get a dozen or so stack traces of a task while a job is running — if most stacks have the
M/R thread in the same place it is a likely bottleneck.

On 6/26/11 6:25 AM, "ey-chih chow" <eychih@hotmail.com<mailto:eychih@hotmail.com>>

I have not done a through investigation.  But from what I was told is that the amount of network
data push among the nodes of our cluster is very low.  Most of the time seems to be local

Date: Sat, 25 Jun 2011 07:27:17 -0700
Subject: Re: improve performance of avro map reduce jobs
From: rbodkin@thinkbiganalytics.com<mailto:rbodkin@thinkbiganalytics.com>
To: user@avro.apache.org<mailto:user@avro.apache.org>

To me, Avro offers a big benefit to MapReduce jobs by having a well-defined file format that
defines the schema for records that was used to write it, is splittable, compressable, and
has metadata. Ultimately, I'd like to see that and a binding layer on top of data serialization
and more flexibility for serialization in Avro (e.g., why not be able to use its APIs but
Protobuf for binding).

That being said, I'm curious where you see the CPU going in your jobs? Where in the Avro serialization
is it spending its time? I suspect that GenericData isn't nearly as performant as using codegen.

On Sat, Jun 25, 2011 at 6:27 AM, ey-chih chow <eychih@hotmail.com<mailto:eychih@hotmail.com>>
Let me put the question in another way.  Companies like Twitter they use Protocol Buffer as
their serialization tool.  It seems to have better performance.  Is there any compelling reason
that Avro can do and Protocol Buffer cannot ?  Thanks.


From: eychih@hotmail.com<mailto:eychih@hotmail.com>
To: user@avro.apache.org<mailto:user@avro.apache.org>
Subject: improve performance of avro map reduce jobs
Date: Fri, 24 Jun 2011 16:55:58 -0700

Our Map/Reduce jobs are all based on avro.  We would like to enhance their performance.  The
objects collected in our mappers and reducers are mainly of the type GenericData.Record. 
Currently, most of jobs are CPU, rather than IO, bound.  Can anybody suggest ways to improve
the performance of the jobs?  Thanks a lot.

Ey-Chih Chow


Ron Bodkin
Think Big Analytics<http://www.thinkbiganalytics.com>
m: +1 (415) 509-2895

View raw message