hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Q about use of avro logs with hadoop streaming
Date Tue, 06 Apr 2010 18:32:16 GMT
Mona Gandhi wrote:
> i currently use avro version 1.3.0 to log data. I am having difficulty processing these
avro logs via a map reduce job written in Python using hadoop streaming(v 0.21.0). 


There is currently no support for Avro data in streaming.  One could use 
a shell command to convert Avro data to lines of text (e.g., Avro's 
'tojson' tool) but that would be rather inefficient.

A good approach would be something akin to Hadoop Pipes: we implement a 
Java mapper and reducer that use an Avro protocol to communicate with a 
subprocess over standard input and output, transmitting input and output 
records as raw binary.  The subprocess would deserialize inputs, call 
the user-provided mapper or reducer function, then serialize outputs 
back.  This would require no changes to Hadoop and could be included in 
Avro.  We'd provide implementations of this protocol for the various 
languages, Python, Ruby, C, C++, etc., enabling high-performance 
mapreduce programs over Avro data for all of these.

The existing Hadoop Pipes implementation would be a good starting point 
for this work, as it already uses the same technique, although with a 
Hadoop Writable-based protocol and with only a C++ implementation.

I've filed an issue in Jira to track this:


I might have a chance to work on this later this month.


View raw message