Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <805304957.30441265133260054.JavaMail.jira@brutus.apache.org>
Date: Tue, 2 Feb 2010 17:54:20 +0000 (UTC)
From: "Tom White (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization
 to get comparator
In-Reply-To: <367401715.1256058959495.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828686#action_12828686 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

Yesterday I spoke to Owen offline about his design for this JIRA. Briefly, it works as follows. (I apologize in advance for any errors due to misunderstandings on my part! Owen, please correct anything I've got wrong.)

The Serialization classes change as follows. SerializationFactory becomes RootSerializationFactory, while Serialization becomes SerializationFactory, and Serializer/Deserializer pairs are combined into single Serialization per pair. Serialization has the ability to write itself to and read itself from a stream, using its own serialization format. There is a subclass of Serialization called TypedSerialization, which is subclassed by WritableSerialization. AvroGenericSerialization would not be a TypedSerialization. RootSerializationFactory can map types into Serializations in the usual manner (via io.serializations). Non-typed serializations, such as AvroGenericSerialization, would be set explicitly as described below.

Instead of the metadata map, each job context (map input key, map output key, etc) has a serialized serialization that is deserialized for the Task (by the framework) and used to carry out the serialization for that context. The serialized serializations are not stored in the job configuration, but rather are stored in a new file in the job directory which has the format {{(context, serialization class, serialization bytes)+}}.

In terms of configuration for the user, the API looks like the one described by Arun and Chris. That is, {{Job#setSerialization(MapReduceSerializationContext, Serialization)}}.

Some comments:
* The changes to the serialization API are not backwards compatible, so a new package of serializer types would need creating.
* I'm not sure why we need to serialize serializations. The current patch avoids the need for this by using a simple string mechanism for configuration. Having an opaque binary format also makes it difficult to retrieve and use the serialization from other languages (e.g. C++ or other Pipes languages). The current patch is language-neutral in this regard.
* Adding a side file for the context-serializer mapping complicates the implementation. It's not clear what container file would be used for the side file (Avro container, custom). I understand that putting framework configuration in the job configuration may not be desirable, but it has been done in the past so I don't know why it is being ruled out here. I would rather have a separate effort (and discussion) to create a "private" job configuration (not accessible by user code) for such configuration.
* The user API is no shorter than the one in the current patch. Compare:

{code}
Schema keySchema = ...
AvroGenericSerialization serialization = new AvroGenericSerialization();
serialization.setSchema(keySchema);
job.set(MAP_OUTPUT_KEY, serialization);
{code}

with

{code}
Schema keySchema = ...
AvroGenericData.setMapOutputKeySchema(job, keySchema);
{code}

I was hoping it might help reach consensus if we could incorporate some of Owen's ideas with the existing patch. However it is not clear to me how to do this. In the meantime, I would appreciate it if someone would review the latest patch. Thanks!


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.