hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: mapreduce and python
Date Tue, 21 Jun 2011 05:45:40 GMT
I'd like to +1 to using Dumbo for all things Python and Hadoop
MapReduce. Its one of the better ways to do things.

Do look at the initial conversation here:
http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
as well.

The feature/bug fixes specified in the post are present in Apache
Hadoop 0.21 (which isn't deemed to be suited for production use yet)
and is also available in other (in-production-use) Hadoop
distributions such as Cloudera's, which is based off on 0.20.2:
https://ccp.cloudera.com/display/SUPPORT/Downloads

On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <jeremy@lewi.us> wrote:
> Hassen,
>
> I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
> as a solution for using python to implement mappers and reducers.
>
> TypedBytes is a hadoop encoding format that allows binary data
> (including lists and maps) to be encoded in a format that permits the
> serialized data to safely be passed to mappers/reducers via the command
> line through hadoop streaming.
>
> Dumbo is a python library which makes it easy to implement your mappers
> and reducers in python. In particular, it handles decoding the data
> encoded as typedbytes to native python types.
>
> J
> On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
>> Hassen,
>>
>>
>> I have lots of binary data that I parse using Python streaming.
>>
>>
>> The way I do this is stream the binary data into sequence files (the
>> binary data object I save in the key and (null) as the value).
>>
>>
>> Each key then gets written back to me line by line, key by key for an
>> entire block when streaming.
>>
>>
>> To have this work in streaming on the command line you need to
>> use -inputformat SequenceFileAsTextInputFormat
>>
>>
>> To create the sequence files I have a jar file that goes from
>> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
>>
>>
>> I am not sure if you can do this for your data but if not then make
>> your own InputFormat.
>>
>>
>> good luck!
>>
>>
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> Twitter: @allthingshadoop
>> */
>>
>> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.riahi@cern.ch>
>> wrote:
>>         Dear all,
>>
>>         Is it possible to have a binary input to a map code written in
>>         python?
>>
>>         Thank you
>>         Hassen
>>
>>
>>
>
>



-- 
Harsh J

Mime
View raw message