hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
Date Wed, 03 Feb 2010 20:55:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829251#action_12829251

Jay Booth commented on MAPREDUCE-1126:

+1 for the general concept of a lower-level API, great idea

Any thoughts regarding explicitly setting a Mapper per Split?  Joins between different formats
are a pretty primary use case, and it's always awkward using MultipleInputs to shoehorn the
different classes into a single conf..  as I understand it now, with MultipleInputs, the MapTask
wakes up, looks at its input split, compares that to a magic configuration field mapping splits
to mapper classes, and instantiates that mapper class.  Which leads to trouble if you need
to mix it with, say, CombineFileInputFormat or anything else that relies on configuration,
since the different static setConfigValue(conf) methods set a single value assuming a single
mapper class.

If we set a specific mapper class per split, and then a specific config per mapper class,
I think it would be a lot more flexible to shoehorn different types of functionality if you're
a framework author -- if you're just a user, maybe you don't want to deal with the extra environment
setup for simple jobs but if this is a lower level API, maybe it could be useful?  It would
certainly be cleaner if a single-input job is just a N=1 multiple inputs job, rather than
the current situation where a multiple inputs job is a configuration-level hack on top of
the single-input framework.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch,
MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization
API to create key comparators.  This would permit, e.g., Avro-based comparators to be used,
permitting efficient sorting of complex data types without having to write a RawComparator
in Java.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message