hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1230) Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes
Date Tue, 29 Jul 2008 18:37:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617901#action_12617901
] 

Owen O'Malley commented on HADOOP-1230:
---------------------------------------

{quote}
1. What is the contract for cleanup()? Is is called if map()/reduce() throws an exception?
I think it should be, so Mapper/Reducer#run should call cleanup() in a finally clause.
{quote}

Currently, it is just:
{code}
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    KEYIN key = context.nextKey(null);
    VALUEIN value = null;
    while (key != null) {
      value = context.nextValue(value);
      map(key, value, context);
      key = context.nextKey(key);
    }
    cleanup(context);
  }
{code}

I thought about it, but it seemed to confuse things more than it helped. I guess it mostly
depends on whether cleanup is used to close file handles, which should happen, or to process
the last record which shouldn't happen. Of course by overriding the run method, the user can
do either. What are other people's thoughts?

{quote}
2. One of the things that the previous version supported was a flexible way of handling large
value classes. If your value is huge you may not want to deserialize it into an object, but
instead read the byte stream directly. This isn't apart of this issue, but I think the current
approach will support it by i) adding streaming accessors to the context, ii) overriding the
run() method to pass in a null value, so map()/reduce() implementations get the value byte
stream from the context. (More generally, this might be the approach to support HADOOP-2429.)
Does this sound right?
{quote}

The problem that I have is that it would need to bypass the RecordReader to do it. If you
add to the context
{code}
InputStream getKey() throws IOException;
InputStream getValue() throws IOException;
{code}
you need to add a parallel method in RecordReader to get raw keys. And presumably the same
trick in the RecordWriter for output. On the other hand, a lazy value class could have a file-backed
implementation could work with the object interface. Am I missing how this would work?

{quote}
3. ReduceContext could be made to implement Iterable<VALUEIN>, to make it slightly more
concise to iterate over the values (for expert use in the run method). The reduce method would
be unchanged.
{quote}

It is a pretty minor improvement of
{code}
for(VALUE v: context)
- versus -
for(VALUE v: context.getValues())
{code}
and means that the ReduceContext needs an iterator() method that is relatively ambiguous between
iterating over keys or values. I think the current explicit method makes it cleaner.

{quote}
4. Although not a hard requirement, it would be nice to make the user API serialization agnostic.
I think we can make InputSplit not implement Writable, and use a SerializationFactory to serialize
splits.
{quote}

This makes sense.

{quote}
5. Is this a good opportunity to make TextInputFormat extend FileInputFormat<Text, NullWritable>,
like HADOOP-3566?
{quote}

*smile* It probably makes sense, although I'm a little hesitant to break yet another thing.

{quote}
6. JobContext#getGroupingComparator has javadoc that refers to WritableComparable, when it
should be RawComparable.
{quote}

+1

> Replace parameters with context objects in Mapper, Reducer, Partitioner, InputFormat,
and OutputFormat classes
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1230
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1230
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: context-objs-2.patch, context-objs-3.patch, context-objs.patch
>
>
> This is a big change, but it will future-proof our API's. To maintain backwards compatibility,
I'd suggest that we move over to a new package name (org.apache.hadoop.mapreduce) and deprecate
the old interfaces and package. Basically, it will replace:
> package org.apache.hadoop.mapred;
> public interface Mapper extends JobConfigurable, Closeable {
>   void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)
throws IOException;
> }
> with:
> package org.apache.hadoop.mapreduce;
> public interface Mapper extends Closable {
>   void map(MapContext context) throws IOException;
> }
> where MapContext has the methods like getKey(), getValue(), collect(Key, Value), progress(),
etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message