Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Message-ID: <4BBB7E30.8040602@apache.org>
Date: Tue, 06 Apr 2010 11:32:16 -0700
From: Doug Cutting <cutting@apache.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100317)
MIME-Version: 1.0
To: general@hadoop.apache.org
Subject: Re: Q about use of avro logs with hadoop streaming
References: <859D76D2-50AE-4E66-974A-3B47621323F6@apture.com>
In-Reply-To: <859D76D2-50AE-4E66-974A-3B47621323F6@apture.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Mona Gandhi wrote:
> i currently use avro version 1.3.0 to log data. I am having difficulty processing these avro logs via a map reduce job written in Python using hadoop streaming(v 0.21.0). 

Mona,

There is currently no support for Avro data in streaming.  One could use 
a shell command to convert Avro data to lines of text (e.g., Avro's 
'tojson' tool) but that would be rather inefficient.

A good approach would be something akin to Hadoop Pipes: we implement a 
Java mapper and reducer that use an Avro protocol to communicate with a 
subprocess over standard input and output, transmitting input and output 
records as raw binary.  The subprocess would deserialize inputs, call 
the user-provided mapper or reducer function, then serialize outputs 
back.  This would require no changes to Hadoop and could be included in 
Avro.  We'd provide implementations of this protocol for the various 
languages, Python, Ruby, C, C++, etc., enabling high-performance 
mapreduce programs over Avro data for all of these.

The existing Hadoop Pipes implementation would be a good starting point 
for this work, as it already uses the same technique, although with a 
Hadoop Writable-based protocol and with only a C++ implementation.

I've filed an issue in Jira to track this:

https://issues.apache.org/jira/browse/AVRO-512

I might have a chance to work on this later this month.

Doug