hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Pireddu <pire...@crs4.it>
Subject Re: Pydoop 0.7.0-rc1 released
Date Mon, 19 Nov 2012 14:08:01 GMT
On 11/16/2012 10:02 PM, Bart Verwilst wrote:
> Hi Simone,
> I was wondering, is it possible to write AVRO files to hadoop straight
> from your lib ( mixed with avro libs ofcourse )? I'm currently trying to
> come up with a way to read from mysql ( but more complicated than sqoop
> can handle ) and write it out to avro files on HDFS. Is something like
> this feasible with this? How do you see it?
> Thanks!
> Bart


you could use a record writer that uses the python-avro package 
(http://pypi.python.org/pypi/avro/1.7.2).  Unfortunately I've seen a few 
complaints about its speed.  For an example of a RecordWriter 
implemented in Python see wordcount-full in the Pydoop examples.

If that solution turns out  it's too slow for you, you may consider 
writing a Java record writer that uses the standard Avro implementation.

In either case, you'll have to get data to it from your reducers to the 
record writer.  Pydoop only supports emitting byte streams, so you'll 
have to serialize your data as a string of some sort, pass it to pydoop, 
receive it in the RecordWriter where you'll de-serialize it and then 
pass it to the Avro library.

Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452

View raw message