hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1462) Enable context-specific and stateful serializers in MapReduce
Date Fri, 05 Feb 2010 08:36:28 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Owen O'Malley updated MAPREDUCE-1462:
-------------------------------------

    Attachment: h-1462.patch

Ok this patch is a rough sketch of the way we could refactor the serialization interface.


The RootSerilizationFactory provides a factory to look up a serialization given an object's
class.

Serializations contain a serialize and deserialize method along with the Input/OutputStream
to write the object to. They also contain a serializeSelf/deserializeSelf method that serializes
the metadata for that serializer. By having the serializer handle and parse the metadata itself,
it means the framework doesn't need to support each individual serializers' information and
it doesn't lose all type safety of the string to string maps where a single mispelling can
cause an attribute to not be found or an extra space can cause parse errors.

There is a subtype of Serialization named TypedSerialization that is the base class for all
of the serializations that use the object's class as their metadata. This would include the
Writable, Thrift, ProtocolBuffers, and Avro Specific serializers.

For containers such as SequenceFile and T-File, the file would contain the name of the serializer
class and the serializer class' metadata. That is enough to reconstruct the serialization
and deserialize the objects. For writing a SequenceFile or T-File, you can specify the types
as currently and let the root serialization factory pick the serializers or you can provide
them explicitly.

In terms of how this would hook to MapReduce, the job would have the map outputs' class name
configured as currently, but it would just require that the actual type be assignable to the
declared type. (ie. you can set the map output type to Object and pass anything.) 

There would be an enumeration of the contexts and a method to set a serialization for a specific
context:

{noformat}
public enum SerializationContext {
   DEFAULT, 
   MAP_OUTPUT_KEY, 
   MAP_OUTPUT_VALUE, 
   REDUCE_OUTPUT_KEY, 
   REDUCE_OUTPUT_VALUE, 
   INPUT_SPLIT
};
{noformat}

and the Job/JobContext would get a new setter/getter for getting the Serialization for each
context. If the user doesn't specify a given context, it will use the default. If the default
isn't specified, it will use the root serialization factory for the assignable type.



> Enable context-specific and stateful serializers in MapReduce
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1462
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1462
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: task
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: h-1462.patch
>
>
> Although the current serializer framework is powerful, within the context of a job it
is limited to picking a single serializer for a given class. Additionally, Avro generic serialization
can make use of additional configuration/state such as the schema. (Most other serialization
frameworks including Writable, Jute/Record IO, Thrift, Avro Specific, and Protocol Buffers
only need the object's class name to deserialize the object.)
> With the goal of keeping the easy things easy and maintaining backwards compatibility,
we should be able to allow applications to use context specific (eg. map output key) serializers
in addition to the current type based ones that handle the majority of the cases. Furthermore,
we should be able to support serializer specific configuration/metadata in a type safe manor
without cluttering up the base API with a lot of new methods that will confuse new users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message