hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassen Riahi <hassen.ri...@cern.ch>
Subject Re: mapreduce and python
Date Wed, 22 Jun 2011 13:18:06 GMT
I'm trying these solutions...Thanks for suggestions.

> I'd like to +1 to using Dumbo for all things Python and Hadoop
> MapReduce. Its one of the better ways to do things.
>
> Do look at the initial conversation here:
> http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
> as well.
>
> The feature/bug fixes specified in the post are present in Apache
> Hadoop 0.21 (which isn't deemed to be suited for production use yet)
> and is also available in other (in-production-use) Hadoop
> distributions such as Cloudera's, which is based off on 0.20.2:
> https://ccp.cloudera.com/display/SUPPORT/Downloads
>
> On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <jeremy@lewi.us> wrote:
>> Hassen,
>>
>> I've been very succesful using Hadoop Streaming, Dumbo, and  
>> TypedBytes
>> as a solution for using python to implement mappers and reducers.
>>
>> TypedBytes is a hadoop encoding format that allows binary data
>> (including lists and maps) to be encoded in a format that permits the
>> serialized data to safely be passed to mappers/reducers via the  
>> command
>> line through hadoop streaming.
>>
>> Dumbo is a python library which makes it easy to implement your  
>> mappers
>> and reducers in python. In particular, it handles decoding the data
>> encoded as typedbytes to native python types.
>>
>> J
>> On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
>>> Hassen,
>>>
>>>
>>> I have lots of binary data that I parse using Python streaming.
>>>
>>>
>>> The way I do this is stream the binary data into sequence files (the
>>> binary data object I save in the key and (null) as the value).
>>>
>>>
>>> Each key then gets written back to me line by line, key by key for  
>>> an
>>> entire block when streaming.
>>>
>>>
>>> To have this work in streaming on the command line you need to
>>> use -inputformat SequenceFileAsTextInputFormat
>>>
>>>
>>> To create the sequence files I have a jar file that goes from
>>> BufferedReader and writes to  
>>> org.apache.hadoop.io.SequenceFile.Writer
>>>
>>>
>>> I am not sure if you can do this for your data but if not then make
>>> your own InputFormat.
>>>
>>>
>>> good luck!
>>>
>>>
>>> /*
>>> Joe Stein
>>> http://www.linkedin.com/in/charmalloc
>>> Twitter: @allthingshadoop
>>> */
>>>
>>> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.riahi@cern.ch>
>>> wrote:
>>>         Dear all,
>>>
>>>         Is it possible to have a binary input to a map code  
>>> written in
>>>         python?
>>>
>>>         Thank you
>>>         Hassen
>>>
>>>
>>>
>>
>>
>
>
>
> -- 
> Harsh J


Mime
View raw message