hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: Sequence file format in python and serialization
Date Fri, 03 Jun 2011 01:07:37 GMT
Thanks Jeremy,
I will look into details provided by you

Sent from my iPhone

On Jun 2, 2011, at 6:12 AM, Jeremy Lewi <jeremy@lewi.us> wrote:

> JJ
> 
> If you want to use complex types in a streaming job I think you need to
> encode the values using the typedbytes format within the sequence file;
> i.e the key and value in the sequence file are both typedbytes writable.
> This is independent of the language the mapper and reducer is written in
> because the values needed to be encoded as a byte stream in such a way
> that the binary stream doesn't contain any characters that would cause
> problems when passed in via stdin/stdout. 
> 
> In python your mapper/reducer will pull in strings from stdin which can
> be decoded from typedbytes to python types.
> 
> The easiest way to do this is to use dumbo
> (https://github.com/klbostee/dumbo/wiki) to write your python
> mapper/reducer. The dumbo module handles the
> serialization/deserialization to/from typedbytes to native python types.
> 
> J
> 
> On Thu, 2011-06-02 at 00:06 -0700, Mapred Learn wrote:
>> Hi,
>> I have a question regarding using sequence file input format in hadoop
>> streaing jar with mappers and reducers written in python.
>> 
>> If i use sequence file as input format for streaming jar and use
>> mappers written in python, can I take care of serialization and
>> de-serialization in mapper/reducer code ? For eg, if i have complex
>> data-types in sequence file's values, can I de-serialize them in
>> python and run map-red job using streaming jar.
>> 
>> Thanks in advance,
>> -JJ
> 

Mime
View raw message