Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 95449 invoked from network); 6 Apr 2010 18:32:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Apr 2010 18:32:43 -0000 Received: (qmail 5210 invoked by uid 500); 6 Apr 2010 18:32:42 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 5176 invoked by uid 500); 6 Apr 2010 18:32:42 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 5168 invoked by uid 99); 6 Apr 2010 18:32:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 18:32:41 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.9] (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 06 Apr 2010 18:32:39 +0000 Received: (qmail 95396 invoked by uid 99); 6 Apr 2010 18:32:17 -0000 Received: from localhost.apache.org (HELO [192.168.42.156]) (127.0.0.1) (smtp-auth username cutting, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 18:32:17 +0000 Message-ID: <4BBB7E30.8040602@apache.org> Date: Tue, 06 Apr 2010 11:32:16 -0700 From: Doug Cutting User-Agent: Thunderbird 2.0.0.24 (X11/20100317) MIME-Version: 1.0 To: general@hadoop.apache.org Subject: Re: Q about use of avro logs with hadoop streaming References: <859D76D2-50AE-4E66-974A-3B47621323F6@apture.com> In-Reply-To: <859D76D2-50AE-4E66-974A-3B47621323F6@apture.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Mona Gandhi wrote: > i currently use avro version 1.3.0 to log data. I am having difficulty processing these avro logs via a map reduce job written in Python using hadoop streaming(v 0.21.0). Mona, There is currently no support for Avro data in streaming. One could use a shell command to convert Avro data to lines of text (e.g., Avro's 'tojson' tool) but that would be rather inefficient. A good approach would be something akin to Hadoop Pipes: we implement a Java mapper and reducer that use an Avro protocol to communicate with a subprocess over standard input and output, transmitting input and output records as raw binary. The subprocess would deserialize inputs, call the user-provided mapper or reducer function, then serialize outputs back. This would require no changes to Hadoop and could be included in Avro. We'd provide implementations of this protocol for the various languages, Python, Ruby, C, C++, etc., enabling high-performance mapreduce programs over Avro data for all of these. The existing Hadoop Pipes implementation would be a good starting point for this work, as it already uses the same technique, although with a Hadoop Writable-based protocol and with only a C++ implementation. I've filed an issue in Jira to track this: https://issues.apache.org/jira/browse/AVRO-512 I might have a chance to work on this later this month. Doug