Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <3538191.1192034571303.JavaMail.jira@brutus>
Date: Wed, 10 Oct 2007 09:42:51 -0700 (PDT)
From: "Owen O'Malley (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Issue Comment Edited: (HADOOP-1986) Add support for a
 general serialization mechanism for Map Reduce
In-Reply-To: <8648883.1191358670825.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533789 ] 

owen.omalley edited comment on HADOOP-1986 at 10/10/07 9:42 AM:
-----------------------------------------------------------------

*laugh*

Option 1 is *precisely* what I was proposing, except that you keep blurring how this part works: 
{quote}
we simply obtain the right serializer object (RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file
{quote}
My proposal spells out how that happens by saying that the configuration has a map between root classes and the corresponding serializer class. When the factory is given an object, it consults the map and constructs the correct serializer.  Clearly the factory will cache the information so that it doesn't have to traverse the class hierarchy for constructing each serializer.

It just simplifies things a bit if the serializer class specifies which class they work on instead of configuring a value like:

{code}
org.apache.hadoop.io.Writable->org.apache.hadoop.io.WritableSeraizlier 
{code}

we can just provide a list of serializers the factory can automatically determine what they apply to. Now that I think about it, just having

{code}
interface Serializer<T> {
   void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in)  
}
{code}

because we can use reflection to find the value of T for a given factory class. So, the serializers would just look like:

{code}
class ThriftSerializer implements Serializer<ThriftRecord> {
   void serialize(ThriftRecord t, OutputStream out) throws IOException {...}
  void deserialize(ThriftRecord t, InputStream in)  throws IOException {...}
}

class WritableSerializer implements Serializer<Writable> {
   void serialize(Writable t, OutputStream out) throws IOException {...}
  void deserialize(Writable t, InputStream in)  throws IOException {...}
}
{code}

and in the config put:

{code}
<property>
  <name>hadoop.serializers</name>
  <value>org.apache.hadoop.io.WritableSerializer,com.facebook.hadoop.ThriftSerializer</value>
  <description>The list of serializers available to Hadoop</description>
</property>
{code}


      was (Author: owen.omalley):
    *laugh*

Option 1 is *precisely* what I was proposing, except that you keep blurring how this part works: 
{quote}
we simply obtain the right serializer object (RecordIOSerializer or ThriftSerializer or whatever) using a factory or through a configuration file
{quote}
My proposal spells out how that happens by saying that the configuration has a map between root classes and the corresponding serializer class. When the factory is given an object, it consults the map and constructs the correct serializer.  Clearly the factory will cache the information so that it doesn't have to traverse the class hierarchy for constructing each serializer.

It just simplifies things a bit if the serializer class specifies which class they work on instead so that instead of configuring a value like:

{code}
org.apache.hadoop.io.Writable->org.apache.hadoop.io.WritableSeraizlier 
{code}

we can just provide a list of serializers the factory can automatically determine what they apply to. Now that I think about it, just having

{code}
interface Serializer<T> {
   void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in)  
}
{code}

because we can use reflection to find the value of T for a given factory class. So, the serializers would just look like:

{code}
class ThriftSerializer implements Serializer<ThriftRecord> {
   void serialize(ThriftRecord t, OutputStream out) throws IOException {...}
  void deserialize(ThriftRecord t, InputStream in)  throws IOException {...}
}

class WritableSerializer implements Serializer<Writable> {
   void serialize(Writable t, OutputStream out) throws IOException {...}
  void deserialize(Writable t, InputStream in)  throws IOException {...}
}
{code}

and in the config put:

{code}
<property>
  <name>hadoop.serializers</name>
  <value>org.apache.hadoop.io.WritableSerializer,com.facebook.hadoop.ThriftSerializer</value>
  <description>The list of serializers available to Hadoop</description>
</property>
{code}

  
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.16.0
>
>         Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs. While it's possible to write Writable wrappers for other serialization frameworks (such as Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types directly, without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.