hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (MAPREDUCE-1462) Enable context-specific and stateful serializers in MapReduce
Date Sun, 07 Feb 2010 09:20:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830684#action_12830684
] 

Owen O'Malley edited comment on MAPREDUCE-1462 at 2/7/10 9:19 AM:
------------------------------------------------------------------

{quote}
The changes to the serialization API are not backwards compatible, so a new package of serializer
types would need creating. Is this really necessary to achieve Avro integration?
{quote}

No, it is not necessary. I believe it to be a much cleaner interface. The current interface
defines the metadata for serializers as a Map<String,String>. The metadata is *not*
supposed to be user facing but defined by each particular serialization to control its own
serialization and deserialization. For opaque data that is not intended to be interpreted
by the user, isn't a binary blob a better way to communicate the intent than a Map<String,String>?

{quote}
I'm not sure why we need to serialize serializations.
{quote}

The goal is to reduce the chance that user's will make a mistake in calling the API. With
your patch on 1126, all of the application's control over serialization is done indirectly
through static methods that reach into the Job's configuration and set the map. So roughly,

1. user calls to serializer-specific configuration code, which manually sets the configuration
with the metadata.
2. to get the serializer, the frameworks gets the metadata from the configuration, and looks
through the list of serializations for the first one that will accept that metadata. The selected
serializer gets the metadata and hopefully does the right thing.

I think it is much clearer and less error-prone, if the framework has a method that takes
a serializer and uses that to get the metadata. The serializer serialization is really just
methods to read and write serialization specific metadata.

Under your 1126 patch, my hypothetical magic serialization looks like:

{code}
public class MyMagicSerialization extends SerializerBase {
  public static void setMapOutputKeyMagicMetdata(Job job, Other metadata) { ... }
  public static void setMapOutputValueMagicMetdata(Job job, Other metadata) { ... }
  public static void setReduceOutputKeyMagicMetdata(Job job, Other metadata) { ... }
  public static void setReduceOutputValueMagicMetdata(Job job, Other metadata) { ... }
  public boolean accept(Map<String,String> metadata) { ... }
  ...
}
{code}

The first thing to notice is that my serialization, which has does *not* depend on MapReduce
needs a bunch of methods that are very concretely tied to MapReduce. Furthermore, it is difficult
to extend by adding new contexts. If HBase needs to use this serializer they need to add a
new method to the *serializer* for their context.

Now look at the equivalent in my scheme:
{code}
public class MyMagicSerialization extends Serialization {
  public void setMMagicMetdata(Other metadata) { ... }
  ...
}
{code}

It does not depend on MapReduce and the way that MapReduce may use it. On the other hand,
I do want a method to set the serializer for each context:

{code}
public class Job extends JobContext {
  enum Context {MAP_OUT_KEY, MAP_OUT_VALUE,  REDUCE_OUT_KEY, REDUCE_OUT_VALUE};
  public void setSerialization(Context context, Serialization serialization);
...
}
{code}

I've pulled all of the MapReduce specific code into MapReduce's Job class. That is a much
better place for it to be. Furthermore, if MapReduce adds a new context, it only means changing
Job by adding a new value to an enum and not adding new methods to all of the serializers.
That is a big win.

Other nice changes are:
* having serialize/deserialize methods rather than objects represents the real semantics in
that they should not be storing state between calls (ie. the bug that hit the original java
serialization).
* it also means that the merge code can be given a single object that it reuse for both serialization
and deserialization rather than one of each
* the new api also means that you can serialize/deserialize to a new stream without recreating
the object

As to where the serialized metadata is stored, I don't care nearly as much. It might make
sense to stick the encoded bytes into the configuration, write it into a new file, or add
it to the input split file. That matters much less to me than getting APIs that are clean,
understandable, and extensible.

      was (Author: owen.omalley):
    {quote}
The changes to the serialization API are not backwards compatible, so a new package of serializer
types would need creating. Is this really necessary to achieve Avro integration?
{quote}

No, it is not necessary. I believe it to be a much cleaner interface. The current interface
defines the metadata for serializers as a Map<String,String>. The metadata is *not*
supposed to be user facing but defined by each particular serialization to control its own
serialization and deserialization. For opaque data that is not intended to be interpreted
by the user, isn't a binary blob a better way to communicate the intent than a Map<String,String>?

{quote}
I'm not sure why we need to serialize serializations.
{quote}

The goal is to reduce the chance that user's will make a mistake in calling the API. With
your patch on 1126, all of the application's control over serialization is done indirectly
through static methods that reach into the Job's configuration and set the map. So roughly,

1. user calls to serializer-specific configuration code, which manually sets the configuration
with the metadata.
2. to get the serializer, the frameworks gets the metadata from the configuration, and looks
through the list of serializations for the first one that will accept that metadata. The selected
serializer gets the metadata and hopefully does the right thing.

I think it is much clearer and less error-prone, if the framework has a method that takes
a serializer and uses that to get the metadata. The serializer serialization is really just
methods to read and write serialization specific metadata.

Under your 1126 patch, my hypothetical magic serialization looks like:

{code}
public class MyMagicSerialization extends SerializerBase {
  public static void setMapOutputKeyMagicMetdata(Job job, Other metadata) { ... }
  public static void setMapOutputValueMagicMetdata(Job job, Other metadata) { ... }
  public static void setReduceOutputKeyMagicMetdata(Job job, Other metadata) { ... }
  public static void setReduceOutputValueMagicMetdata(Job job, Other metadata) { ... }
  public boolean accept(Map<String,String> metadata) { ... }
  ...
}
{code}

The first thing to notice is that my serialization, which has does *not* depend on MapReduce
needs a bunch of methods that are very concretely tied to MapReduce. Furthermore, it is difficult
to extend by adding new contexts. If HBase needs to use this serializer they need to add a
new method to the *serializer* for their context.

Now look at the equivalent in my scheme:
{code}
public class MyMagicSerialization extends Serialization {
  public void setMMagicMetdata(Other metadata) { ... }
  ...
}
{code}

It does not depend on MapReduce and the way that MapReduce may use it. On the other hand,
I do want a method to set the serializer for each context:

{code}
public class Job extends JobContext {
  enum Context {MAP_OUT_KEY, MAP_OUT_VALUE,  REDUCE_OUT_KEY, REDUCE_OUT_VALUE};
  public void setSerialization(Context context, Serialization serialization);
...
}

I've pulled all of the MapReduce specific code into MapReduce's Job class. That is a much
better place for it to be. Furthermore, if MapReduce adds a new context, it only means changing
Job by adding a new value to an enum and not adding new methods to all of the serializers.
That is a big win.

Other nice changes are:
* having serialize/deserialize methods rather than objects represents the real semantics in
that they should not be storing state between calls (ie. the bug that hit the original java
serialization).
* it also means that the merge code can be given a single object that it reuse for both serialization
and deserialization rather than one of each
* the new api also means that you can serialize/deserialize to a new stream without recreating
the object

As to where the serialized metadata is stored, I don't care nearly as much. It might make
sense to stick the encoded bytes into the configuration, write it into a new file, or add
it to the input split file. That matters much less to me than getting APIs that are clean,
understandable, and extensible.
  
> Enable context-specific and stateful serializers in MapReduce
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1462
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1462
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: task
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: h-1462.patch
>
>
> Although the current serializer framework is powerful, within the context of a job it
is limited to picking a single serializer for a given class. Additionally, Avro generic serialization
can make use of additional configuration/state such as the schema. (Most other serialization
frameworks including Writable, Jute/Record IO, Thrift, Avro Specific, and Protocol Buffers
only need the object's class name to deserialize the object.)
> With the goal of keeping the easy things easy and maintaining backwards compatibility,
we should be able to allow applications to use context specific (eg. map output key) serializers
in addition to the current type based ones that handle the majority of the cases. Furthermore,
we should be able to support serializer specific configuration/metadata in a type safe manor
without cluttering up the base API with a lot of new methods that will confuse new users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message