hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
Date Mon, 08 Oct 2007 02:21:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533022
] 

Owen O'Malley commented on HADOOP-1986:
---------------------------------------

Vivek,
   No one was suggesting a serializer per a concrete class, except in the case of Thrift if
they don't implement a generic interface. Your proposal doesn't address how the mapping from
an Object to Serializer is managed. I think my suggestion provides the most flexability since
you only need one serializer per a root class and they don't have any requirements on the
implementation classes at all. Basically, each serialization library that someone wanted to
use with Hadoop would have a single generic serializaer and a library routine would do the
lookups at the first level:

{code}
public interface Serializer<T> {
  void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in) throws IOException;
  // Get the base class that this serializer will work on
  Class<T> getTargetClass();
}
{code}

org.apache.hadoop.io.serializer.WritableSerializer would be coded to read and write any Writable,
while org.apache.hadoop.io.serializer.ThriftSerializer would read and write any Thrift type.

I'd probably make a utility class:

{code}
class org.apache.hadoop.io.serializer.SerializerFactory extends Configured {
  Serializer<T> getSerializer(Class<? extends T> cls);
}
{code}

and presumably the SerializerFactory would include a cache from the class to serializer class
(hopefully with weak references to allow garbage collection). This would allow you to remove
all references to Writable in SequenceFile and the map/reduce classes. Any object could be
written into sequence files or passed around in map/reduce jobs. It would be cool and should
result in only a modest amount of confusion to the users. 

Furthermore, since it makes only relatively minor use of reflection, a C++ implementation
along similar lines should be feasible. (Although it would be a lot more expensive to evaluate,
because dynamic_cast is outrageously expensive because of the C++ multiple inheritance semantics.)

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>         Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs.
While it's possible to write Writable wrappers for other serialization frameworks (such as
Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types
directly, without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message