hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
Date Mon, 07 Jul 2008 07:49:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610905#action_12610905
] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Example of creating and submitting a chain job:

{code:java}

JobConf conf = new JobConf();

// chaining maps in the Map phase

Properties mapAConf = new Properties();
mapAConf.setProperty("a", "A");
ChainMapper.addMapper(conf, AMap.class, mapAConf);

ChainMapper.addMapper(conf, BMap.class, null);

// setting the reducer

Properties reduceConf = new Properties();
ChainReducer.setReducer(conf, XReduce.class, reduceConf);

// chaining maps in the Reduce phase

ChainReducer.addMapper(conf, CMap.class, null);

ChainReducer.addMapper(conf, DMap.class, null);

...

FileInputFormat.setInputPaths(conf, inDir);
FileOutputFormat.setOutputPath(conf, outDir);

JobClient jc = new JobClient(conf);
RunningJob job = jc.submitJob(conf);

{code}

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>
> On the same input, we usually need to run multiple Maps one after the other without no
Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount
of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after
the Reduce.
> This could be done with ChainMapper and ChainReducer classes that would manage the chain
of Maps and they would override the OutputCollector to implement the chaining.
> The Maps and Reduce that are part of the Chain are unware they are executed in a Chain,
they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.
> The API would look something like:
> {code:java}
> public class ChainMapper implements Mapper {
>   public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties
mapperConf);
>   ...
> }
> public class ChainReducer implements Reducer {
>   public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties
reducerConf);
>   public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties
mapperConf);
>   ...
> }
> {code}
> The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting
them into the chain are injected into a copy of the job's configuration. This allows maps
to be configured as usual without being aware that they are in a chain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message