hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
Date Thu, 28 Jan 2010 00:20:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805709#action_12805709

Chris Douglas commented on MAPREDUCE-1126:

Replacing the type driven serialization with an explicitly specified, context-sensitive factory
is 1) throwing away all Java type hierarchies, 2) asserting that the serialization defines
the user types, and 3) implying that these types- and relationships between them- should remain
opaque to the MapReduce framework.

It's making a tradeoff discussed in HADOOP-6323: all the type checks are removed from the
framework, but enforced by the serializer. So {{WritableSerialization}}- appropriately- requires
an exact match for the configured class, but other serializers may not. The MapReduce framework
can't do any checks of its own- neither, notably, may Java- to verify properties of the types
users supply; their semantics are _defined by_ the serialization. For example, a job using
related {{Writable}} types may pass a compile-time type check, work with explicit Avro serialization
in the intermediate data, but fail if it were run with implicit Writable serialization.

This is a *huge* shift. It means the generic, Java types for the Mapper, Reducer, collector
etc. literally don't matter; they're effectively all {{Object}} (relying on autoboxing to
collect primitive types). This means that every serialization has its own type semantics which
need not look anything like what Java can enforce, inspect, or interpret. Given this, that
the patch puts the serialization as the most prominent interface to MapReduce is not entirely

It's also powerful functionality. By allowing any user type to be serialized/deserialized
per context, the long-term elimination of the key/value distinction doesn't change {{collect(K,V)}}
to {{collect(Object)}} as proposed, but rather {{collect(Object...)}}: the serializer transforms
the record into bytes, and the comparator works on that byte range, determining which bytes
are relevant per the serialization contract. Especially for frameworks written on top of MapReduce,
less restrictive interfaces here would surely be fertile ground for performance improvements.

That said: I hate this API for users. Someone writing a MapReduce job is writing a transform
of data; how these data are encoded in different contexts is usually irrelevant to their task.
Forcing the user to pick a serialization to declare their types to- rather than offering their
types to MapReduce- is backwards for the vast majority of cases. Consider the Writable subtype
example above: one is tying the correctness of the {{Mapper}} to the intermediate serialization
declared in the submitter code, whose semantics are inscrutable. That's just odd.

If one's map is going to emit data without a common type, then doesn't it make sense to declare
that instead of leaving the signature as {{Object}}? That is, particularly given MAPREDUCE-1411,
wouldn't the equivalent of {{Mapper<Text,Text,Text,AvroRecord>}} be a more apt signature
than {{Mapper<Text,Text,Text,Object>}} for an implementation emitting {{int}} and {{String}}
as value types?

I much prefer the semantics of the global serializer, but wouldn't object to adding an inconspicuous
knob in support of context-sensitive serialization. Would a {{Job::setSerializationFactory(CTXT,
SerializationFactory...)}} method, such that {{CTXT}} is an enumerated type of framework-hooks
(i.e. {{DEFAULT}}, {{MAP_OUTPUT_KEY}}, {{MAP_OUTPUT_VALUE}}, etc.) be satisfactory? This way,
one can instruct the framework to use/prefer a particular serialization in one context without
requiring most users to change their jobs. It also permits continued use of largely type-based
serialization which- as Tom notes- is a very common case. Writing wrappers can be irritating,
but for the MR API, I'd rather make it easier on common cases and users than on advanced uses
and framework authors.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch,
MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization
API to create key comparators.  This would permit, e.g., Avro-based comparators to be used,
permitting efficient sorting of complex data types without having to write a RawComparator
in Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message