hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
Date Wed, 03 Oct 2007 22:03:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532269

Joydeep Sen Sarma commented on HADOOP-1986:

i have been working on putting thrift structs into Hdfs. I have been a happy camper so far
(at least as far as hadoop/hdfs are concerned). Just for reference - this is what it ended
up looking like:

- use BytesWritable to wrap thrift structs (and store the same in sequencefiles)
- for writing structs - i haven't had to allocate TTransport and TProtocol objects everytime.
Resetting the buffer in a ByteArrayOutputStream works. i expect similar strategy to work for
reading (might need to extend ByteArrayInputStream)
- as far as invoking the right serializer/deserializer - it's easy to do this with Reflection:
   * When loading data into  hdfs - the name of the thrift class is encoded before the serialized
struct. (this is a property of TTransport). The function signatures for serialize/deserialize
are constant allowing easy use of reflection
   * Data is loaded into hdfs in a way that also allows us to know the class-name for any
serialized struct - so again we use reflection to deserialize while processing. (There are
different ways of arranging this).

Of course - if the data is homogenous - then reflection is not required. But that's not the
case with our data set. 

I haven't found the Writable interface to be a serious constraint in any way. There are some
inefficiencies in the above process:
1. the use of extra length field from using BytesWritable
2. the use of reflection (don't know what the overhead's like)

But i am not sure these are big burdens. (I don't even understand how to conceivably avoid

One of the good things about Thrift is cross-language code generation. What we would do at
some point is allow Python (or perl) code to work on Binary serialized data. The Streaming
library already seems to allow this (will pass BytesWritable key,values to external map-reduce
handlers) - where the byte array can be deserialized by thrift generated python deserializer
in the python mapper.

As far as the discussions on this Thread go - i am not sure what the final proposal is. One
thing i would opine is that the map-red job is best placed to understand what serialization
library to invoke (instead of the map-reduce infrastructure). For example - we are segregating
different data types in different files - and the thrift class is implicit in the path name
(which is made available as a key). If i understand it correctly - Pig takes a similar stance
(input path is mapped to a serialization library). 

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs.
While it's possible to write Writable wrappers for other serialization frameworks (such as
Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types
directly, without explicit wrapping and unwrapping.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message