hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1230) Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes
Date Mon, 04 Aug 2008 08:11:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619439#action_12619439
] 

Alejandro Abdelnur commented on HADOOP-1230:
--------------------------------------------

I've played a little bit with the proposed API to see how {{MultipleOutputs}} could be integrated
in a more natural way.

I've came up with 2 possible alternatives (following code sample for Mapper, for Reducer it
would similar)

*Option1:*

Defined a {{MapContext}} subclass, {{MOMapContext}}, that wraps a {{MapContext}} instance
delegating all methods to it and adding its own methods for multiple output support.

Defined a {{Mapper}} subclass, {{MOMapper}}, that has an abstract {{moMap(MOMapContext)}}
method and in {{map(MapContext)}} creates a {{MOMapContext}} instance and invokes the {{moMap()}}.

Whoever wants to use multiple outputs should extend the MOMapper class instead Mapper.

The code would look like:

{code}
public abstract class MOMapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends Mapper<MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>>
{
  private MultipleOutputs multipleOutputs;

  public void configure(JobConf jobConf) {
    multipleOutputs = new MultipleOutputs(jobConf);
  }

  public final void map(MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> context) throws IOException
{
    MOMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> moc =
      new MOMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>(context, multipleOutputs);
    moMap(moc);
  }

  public abstract void moMap(MOMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> context) throws
IOException;

  public void close() throws IOException {
    multipleOutputs.close();
  }

}

public class MOMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
{
  private MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> mapContext;
  private MultipleOutputs multipleOutputs;

  public MOMapContext(MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> mapContext, MultipleOutputs
multipleOutputs) {
    this.mapContext = mapContext;
    this.multipleOutputs = multipleOutputs;
  }

  //... delegates all MapContext methods to mapContext instance.

  // MO methods

  public void collect(String namedOutput, Object key, Object value) throws IOException {
    Reporter reporter = null; //TODO, how do I get a reporter ????
    multipleOutputs.getCollector(namedOutput, reporter).collect(key, value);
  }

  public void collect(String namedOutput, String multiName, Object key, Object value) throws
IOException {
    Reporter reporter = null; //TODO, how do I get a reporter ????
    multipleOutputs.getCollector(namedOutput, multiName, reporter).collect(key, value);
  }

}
{code}

*Option2:*

Defined a {{MapContext}} subclass, {{MOMapContext}}, that extends the concrete {{MapContext}}
IMPL adding methods for multiple output support.

The {{MapContext}} IMPL class should be both {{Configurable}} and {{Closeable}} (in the same
lifecycle as the Mapper).

The TaskRunner should look in the {{JobConf}} what implementation of the {{MapContext}} to
use.

Whoever wants to use multiple outputs just defines his/her Mapper as {{extends Mapper<MOMapContext<KIN,
VIN, KOUT, VOUT>>}} and defines the multiple outpus in the {{JobConf}} as usual (this
would set the right {{MapContext}} implementation).

{code}
public class MOMapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends MapContextIMPL<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
{
  private MultipleOutputs multipleOutputs;

  public void configure(JobConf jobConf) {
    super.configure(jobConf);
    multipleOutputs = new MultipleOutputs(jobConf);
  }

  // MO methods

  public void collect(String namedOutput, Object key, Object value) throws IOException {
    Reporter reporter = null; //TODO, how do I get a reporter ????
    multipleOutputs.getCollector(namedOutput, reporter).collect(key, value);
  }

  public void collect(String namedOutput, String multiName, Object key, Object value) throws
IOException {
    Reporter reporter = null; //TODO, how do I get a reporter ????
    multipleOutputs.getCollector(namedOutput, multiName, reporter).collect(key, value);
  }

  public void close() throws IOException {
    multipleOutputs.close();
    super.close();
  }

}
{code}

IMO *Option 2* it would be more natural to the Map/Reduce developer as it does not introduce
a separate Map/Reduce class with a different method {{moMap()}} to do the actual map logic
and it does not need to create a lightweight {{MOMapContext}} on every {{map()}} invocation.

In both cases I need to figure out how to get a {{Reporter}} to pass to the {{MultipleOutputs}}
when getting the {{OutputCollector}} this is required as the the {{MultipleOutputs}} use counters.

Thoughts?


> Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat,
and OutputFormat classes
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1230
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1230
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: context-objs-2.patch, context-objs-3.patch, context-objs.patch
>
>
> This is a big change, but it will future-proof our API's. To maintain backwards compatibility,
I'd suggest that we move over to a new package name (org.apache.hadoop.mapreduce) and deprecate
the old interfaces and package. Basically, it will replace:
> package org.apache.hadoop.mapred;
> public interface Mapper extends JobConfigurable, Closeable {
>   void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)
throws IOException;
> }
> with:
> package org.apache.hadoop.mapreduce;
> public interface Mapper extends Closable {
>   void map(MapContext context) throws IOException;
> }
> where MapContext has the methods like getKey(), getValue(), collect(Key, Value), progress(),
etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message