hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Of Configurations and Contexts
Date Fri, 12 Feb 2010 02:13:15 GMT
Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a concrete
test case.

On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aaron@cloudera.com> wrote:

> This patch would allow the configuration to flow through the map task in a
> linear fashion:
>
> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
> src/java/org/apache/hadoop/mapred/MapTask.
> index f889a47..4170922 100644
> --- src/java/org/apache/hadoop/mapred/MapTask.java
> +++ src/java/org/apache/hadoop/mapred/MapTask.java
> @@ -637,7 +637,9 @@ class MapTask extends Task {
>
>      org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
> OUTVALUE>
>      mapContext =
> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
> getTaskID(),
> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
> +          taskContext.getConfiguration(),
> +          getTaskID(),
>            input, output,
>            committer,
>            reporter, split);
>
>
>
> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <todd@cloudera.com> wrote:
>> > This is, if I'm understanding Aaron correctly, the same issue that
>> > makes the mapred.input.file configuration very hard to implement in
>> > the new API.
>> >
>>
>> Sorry, I should clarify this. Obviously that particular feature can be
>> accessed by downcasting the input split to FileSplit. But this is very
>> hard to deal with when the input format wants to use a different
>> implementation class - you end up coupling the mapper to the
>> inputformat in a dirty way. Or, if you want to access the input file
>> name from the OutputFormat, I believe you're entirely out of luck
>> (though I haven't looked in trunk). In the prior API where
>> Configuration got passed through the flow nicely, it was trivial to do
>> this.
>>
>> -Todd
>>
>> > -Todd
>> >
>> > On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com>
>> wrote:
>> >> Hi folks,
>> >>
>> >> I've uncovered some behavior in Hadoop that I found surprising. I think
>> this
>> >> represents a design flaw that I'd like to see corrected.
>> >>
>> >> As we well know, decoupled components in a MapReduce job communicate
>> >> information forward through the use of Configuration instances. Every
>> >> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>> >> Configuration object inside, accessible via getConfiguration().
>> >>
>> >> The semantics of passing data from the "configuration phase" to the
>> "run
>> >> phase" is easy; the user creates a Job on the client machine, populates
>> its
>> >> Configuration with necessary values, and all those values will be
>> visible in
>> >> the JobContext received in the map/reduce tasks themselves. Every task
>> >> expects to get the same view of the user-configured values here.
>> >>
>> >> Similarly, in my Mapper, if during the setup() method I call
>> >> context.getConfiguration().set("foo","bar"), I expect that
>> >> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>> >> method. During a map task's execution, the configuration moves "forward
>> >> linearly" through time.
>> >>
>> >> The confusing part is that during the initial setup steps of the map
>> task, a
>> >> series of different configurations are used. The noteworthy section of
>> code
>> >> is MapTask.java in the runNewMapper() method (lines 607--650). A
>> JobContext
>> >> is passed in; this is immediately used as the basis for a
>> >> TaskAttemptContext. The TAC is then used to initialize the InputFormat
>> and
>> >> the RecordReader. The JobContext is then re-used to instantiate a
>> >> MapContext. The RecordReader's "initialize" method is then called with
>> this
>> >> context, ostensibly to "switch the RR over" to the MapContext. The
>> Mapper
>> >> itself is then run with the MapContext. Each of these two new Context
>> >> objects makes a deep copy of the Configuration present in JobContext.
>> >>
>> >> The problem here is that if the InputFormat sets any Configuration
>> settings,
>> >> the RecordReader will properly receive those during its construction --
>> but
>> >> the same RecordReader may be using a *different* context and thus a
>> >> *different* configuration during the actual running of the Mapper
>> itself!
>> >> LineRecordReader in particular downcasts its TaskAttemptContext to a
>> >> MapContext at some point during its lifetime, assuming that this
>> >> initialize() call has been made and that the new context is a
>> MapContext.
>> >> This is completely type-unsafe, and prevents LineRecordReader from
>> being
>> >> wrapped inside another RecordReader in all cases.
>> >>
>> >> Furthermore, other RecordReader initialize() methods do not do
>> anything;
>> >> they continue to use the Context they were created with.
>> >>
>> >> So now Configuration settings set in InputFormat.createRecordReader()
>> may or
>> >> may not be present in the Configuration accessible during
>> >> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>> >> semantics (and that of any outer RecordReader wrapping this one!).
>> >>
>> >> This led to a pretty subtle bug in some code I was writing yesterday
>> using
>> >> CombineFileInputFormat, which requires that you wrap some RecordReader
>> >> instances in others.
>> >>
>> >> So my questions are:
>> >> * Is there a solid rationale for isolating the Configuration used in
>> these
>> >> various points in time?
>> >> * If not, is there a reason to make those deep copies of the
>> Configuration?
>> >> or can they all just share a reference to the same Configuration
>> instance?
>> >> * If we really want deep copies, can the MapContext's copy be based off
>> the
>> >> TaskAttemptContext's copy, so that we at least have a linear flow of
>> >> configuration settings through the execution of MapTask.runNewMapper()?
>> >>
>> >> I'm happy to write a patch to make these semantics more clear. As it
>> is, I
>> >> think the notion of needing to reinitialize the RecordReader with a
>> >> completely different context is error-prone. (CombineFileRecordReader,
>> for
>> >> example, in its initialize() method, does not call
>> curReader.initialize() to
>> >> initialize its child. This is a separate bug, which I'll post a patch
>> for,
>> >> but the design of the context situation makes this more problematic
>> than it
>> >> otherwise needs to be.)
>> >>
>> >> Does anyone have any input on this situation?
>> >> Thanks,
>> >> - Aaron Kimball
>> >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message