hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
Date Tue, 26 Jan 2010 17:02:34 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805095#action_12805095

Doug Cutting commented on MAPREDUCE-1126:

Owen, I am sorry you did not realize earlier the magnitude of the
changes implied by this issue.  As Aaron started work on it it became
apparent that the changes required were substantial.  A number of
people were following the issue as it progressed, and I assumed that
it had good oversight.

> 1. We should use the current global serializer factory for *all* contexts of a job.

I don't see that as an acceptable solution.  Job inputs and outputs
can reasonably differ in their serialization methods.  Specifying
seven different serializations can be done in seven lines of code that
replace the seven lines of code we currently use to specify these as
classes.  There are now more possibilties for what those seven lines
can contain, but I don't see this as a huge increase in complexity for
end users.

WritableData.setMapOutputKeyClass(job, class) is not much more
complicated than the current job.setMapOutputKeyClass(class).  The
natural way for an Avro generic data job to specify its map output key
is AvroGenericData.setMapOutputKeySchema(job, schema).  In Avro, a
schema specifies classes, binary format, and sort order.  There is in
general no single class that represents all of these aspects of a
schema.  Different serialization systems can reasonably differ in what
kind of metadata they require.

(A mapreduce simplification that might make sense long-term is to
eliminate the key/value distinction.  Each map input could be a single
object, each map output could be a single comparable object, and each
reduce output a single object.  This would eliminate three settings
per job, and I can think of no use cases where keys and values might
use different serialization systems.)

> 2. Defining the schema should be an Avro specific function.

It is Avro-specific in the current patch, so we agree on this point.

> 3. I don't see any reason to support union types at the top-level of
  the shuffle.

We should not force folks to define wrapper classes not required or
generated by their serialization system just to pass data through
mapreduce.  The Java class namespace is a poor mechanism to represent
mappings between all Java objects and all their possible serialized
forms.  Serialization is not completely determined by a class, and a
class does not completely determine a serialization.  Java Strings
longs and floats can be serialized in different ways.  A given job
might take String data from a file using one serialization system, map
it to a union type using another serialization system that provides
efficient, structured binary comparison, then write the final output
to a database as String using yet another serialization system.  Why
should we require folks to define wrapper classes to achieve this?

> 4. Only the default comparator should come from the serializer.

That would make sense if we only permit a single, global serializer.
If however the shuffle has its own serializer, then it could be done
in either place: the job could define a shuffle comparator, or it
could use the comparator from the shuffle's serializer.  In either
case, users should be able to override the comparator.  Since
comparators are a part of the serialization API, it seems better
modularization to use the comparator specified by the shuffle's
serializer, no?

> That said, I think it is perfectly acceptable for the Avro serializer to accept all types.

That would give the Avro serializer privledged status.  One could not also use another serializer
(e.g., a Pig, Thrift or Hessian serializer) that also accepts String.  Applications should
be able to specify which serializations they intend.

A primary design goal of the Avro project is improving the flexibility
of serialization APIs.  Mapreduce is a primary target application for
Avro.  We should not hobble Avro in Mapreduce.  The Writable model,
where classes define their serialization, has served us well, but that
model is limited.  Avro permits flexible mappings between in-memory
representations and serializations.  We can easily support this in
Mapreduce without either giving Avro privledged status or making the
Mapreduce API overly complex.  I hope you will not block this effort.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch,
MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization
API to create key comparators.  This would permit, e.g., Avro-based comparators to be used,
permitting efficient sorting of complex data types without having to write a RawComparator
in Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message