hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
Date Mon, 25 Jan 2010 21:35:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804734#action_12804734

Owen O'Malley commented on MAPREDUCE-1126:

I've very disappointed that this jira went in with a title and description that completely
misrepresented the content and scope of the patch. This patch *completely* revamps the type
system and semantics of the map/reduce framework. Changing that without a large discussion
is uncool.

I disagree with the fundamental approach taken here. The details are also problematic, but
we need to find an acceptable model before any progress on this or any related patches can
be made.

My concerns are:
  1. We should use the current global serializer factory for *all* contexts of a job. We have
7 serialized types already (map in key, map in value, map out key, map out value, reduce out
key, reduce out value, input split). We will likely end up with more types later. Having a
separate serializer and metadata for each type will be extremely confusing to the users.
  2. Defining the schema should be an Avro specific function and not part of the framework.
  3. I don't see any reason to support union types at the top level of the shuffle. There
are already libraries that handle this without changing the framework. Furthermore, an Avro
record on top of the schema is free in serialization size.
  4. Only the default comparator should come from the serializer. The user has to be able
to override it in the framework (not change the serialier factory).

That said, I think that it is perfectly reasonable for the Avro serializer to accept all types.
So if you have a Mapper<String,String,String,String> it will use Avro serialization.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch,
MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization
API to create key comparators.  This would permit, e.g., Avro-based comparators to be used,
permitting efficient sorting of complex data types without having to write a RawComparator
in Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message