hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
Date Tue, 02 Feb 2010 17:54:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828686#action_12828686
] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

Yesterday I spoke to Owen offline about his design for this JIRA. Briefly, it works as follows.
(I apologize in advance for any errors due to misunderstandings on my part! Owen, please correct
anything I've got wrong.)

The Serialization classes change as follows. SerializationFactory becomes RootSerializationFactory,
while Serialization becomes SerializationFactory, and Serializer/Deserializer pairs are combined
into single Serialization per pair. Serialization has the ability to write itself to and read
itself from a stream, using its own serialization format. There is a subclass of Serialization
called TypedSerialization, which is subclassed by WritableSerialization. AvroGenericSerialization
would not be a TypedSerialization. RootSerializationFactory can map types into Serializations
in the usual manner (via io.serializations). Non-typed serializations, such as AvroGenericSerialization,
would be set explicitly as described below.

Instead of the metadata map, each job context (map input key, map output key, etc) has a serialized
serialization that is deserialized for the Task (by the framework) and used to carry out the
serialization for that context. The serialized serializations are not stored in the job configuration,
but rather are stored in a new file in the job directory which has the format {{(context,
serialization class, serialization bytes)+}}.

In terms of configuration for the user, the API looks like the one described by Arun and Chris.
That is, {{Job#setSerialization(MapReduceSerializationContext, Serialization)}}.

Some comments:
* The changes to the serialization API are not backwards compatible, so a new package of serializer
types would need creating.
* I'm not sure why we need to serialize serializations. The current patch avoids the need
for this by using a simple string mechanism for configuration. Having an opaque binary format
also makes it difficult to retrieve and use the serialization from other languages (e.g. C++
or other Pipes languages). The current patch is language-neutral in this regard.
* Adding a side file for the context-serializer mapping complicates the implementation. It's
not clear what container file would be used for the side file (Avro container, custom). I
understand that putting framework configuration in the job configuration may not be desirable,
but it has been done in the past so I don't know why it is being ruled out here. I would rather
have a separate effort (and discussion) to create a "private" job configuration (not accessible
by user code) for such configuration.
* The user API is no shorter than the one in the current patch. Compare:

{code}
Schema keySchema = ...
AvroGenericSerialization serialization = new AvroGenericSerialization();
serialization.setSchema(keySchema);
job.set(MAP_OUTPUT_KEY, serialization);
{code}

with

{code}
Schema keySchema = ...
AvroGenericData.setMapOutputKeySchema(job, keySchema);
{code}

I was hoping it might help reach consensus if we could incorporate some of Owen's ideas with
the existing patch. However it is not clear to me how to do this. In the meantime, I would
appreciate it if someone would review the latest patch. Thanks!


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch,
MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization
API to create key comparators.  This would permit, e.g., Avro-based comparators to be used,
permitting efficient sorting of complex data types without having to write a RawComparator
in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message