hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
Date Wed, 16 Jul 2008 07:41:32 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alejandro Abdelnur updated HADOOP-3702:

    Attachment: patch3702.txt

Added byValue support to preserve the semantics of the collector not modifying the key and

Still byReference can be used as an optimization if the mappers and reducer in the chain don't
use the byValue semantics.

Improved the patch to allow different key/value classes to be used by different elements in
the chain, only restriction is that the output of a mapper has to match the expected input
of the next mapper.

Sample usage:


  JobConf conf = new JobConf();
  FileInputFormat.setInputPaths(conf, inDir);
  FileOutputFormat.setOutputPath(conf, outDir);

  JobConf mapAConf = new JobConf();
  ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class,
true, mapAConf);
  JobConf mapBConf = new JobConf();
  ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class,
false, mapBConf);
  JobConf reduceConf = new JobConf();
  ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class,
Text.class, true, reduceConf);
  ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class,
false, null);
  ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class,
LongWritable.class, true, null);
  JobClient jc = new JobClient(conf);
  RunningJob job = jc.submitJob(conf);

Note: the previous to last {{boolean}} parameter indicates if the Mapper/Reducer added in
to the chain wants a byValue (TRUE) or byReference (FALSE)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
> On the same input, we usually need to run multiple Maps one after the other without no
Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount
of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after
the Reduce.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message