Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4DA38D918 for ; Mon, 19 Nov 2012 14:09:08 +0000 (UTC) Received: (qmail 8861 invoked by uid 500); 19 Nov 2012 14:09:03 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 8642 invoked by uid 500); 19 Nov 2012 14:09:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 8631 invoked by uid 99); 19 Nov 2012 14:09:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2012 14:09:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [156.148.72.33] (HELO raffaello.crs4.it) (156.148.72.33) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2012 14:08:54 +0000 Received: from [156.148.72.124] (slynx.crs4.it [156.148.72.124]) by raffaello.crs4.it (Postfix) with ESMTP id AE82D79012F for ; Mon, 19 Nov 2012 15:08:32 +0100 (CET) Message-ID: <50AA3D41.5060705@crs4.it> Date: Mon, 19 Nov 2012 15:08:01 +0100 From: Luca Pireddu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: Pydoop 0.7.0-rc1 released References: <50A246F5.5060904@crs4.it> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 11/16/2012 10:02 PM, Bart Verwilst wrote: > Hi Simone, > > I was wondering, is it possible to write AVRO files to hadoop straight > from your lib ( mixed with avro libs ofcourse )? I'm currently trying to > come up with a way to read from mysql ( but more complicated than sqoop > can handle ) and write it out to avro files on HDFS. Is something like > this feasible with this? How do you see it? > > Thanks! > > Bart Hello, you could use a record writer that uses the python-avro package (http://pypi.python.org/pypi/avro/1.7.2). Unfortunately I've seen a few complaints about its speed. For an example of a RecordWriter implemented in Python see wordcount-full in the Pydoop examples. If that solution turns out it's too slow for you, you may consider writing a Java record writer that uses the standard Avro implementation. In either case, you'll have to get data to it from your reducers to the record writer. Pydoop only supports emitting byte streams, so you'll have to serialize your data as a string of some sort, pass it to pydoop, receive it in the RecordWriter where you'll de-serialize it and then pass it to the Avro library. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 09010 Pula (CA), Italy Tel: +39 0709250452