hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Of Configurations and Contexts
Date Wed, 10 Feb 2010 22:33:35 GMT
On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <todd@cloudera.com> wrote:
> This is, if I'm understanding Aaron correctly, the same issue that
> makes the mapred.input.file configuration very hard to implement in
> the new API.
>

Sorry, I should clarify this. Obviously that particular feature can be
accessed by downcasting the input split to FileSplit. But this is very
hard to deal with when the input format wants to use a different
implementation class - you end up coupling the mapper to the
inputformat in a dirty way. Or, if you want to access the input file
name from the OutputFormat, I believe you're entirely out of luck
(though I haven't looked in trunk). In the prior API where
Configuration got passed through the flow nicely, it was trivial to do
this.

-Todd

> -Todd
>
> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com> wrote:
>> Hi folks,
>>
>> I've uncovered some behavior in Hadoop that I found surprising. I think this
>> represents a design flaw that I'd like to see corrected.
>>
>> As we well know, decoupled components in a MapReduce job communicate
>> information forward through the use of Configuration instances. Every
>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>> Configuration object inside, accessible via getConfiguration().
>>
>> The semantics of passing data from the "configuration phase" to the "run
>> phase" is easy; the user creates a Job on the client machine, populates its
>> Configuration with necessary values, and all those values will be visible in
>> the JobContext received in the map/reduce tasks themselves. Every task
>> expects to get the same view of the user-configured values here.
>>
>> Similarly, in my Mapper, if during the setup() method I call
>> context.getConfiguration().set("foo","bar"), I expect that
>> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>> method. During a map task's execution, the configuration moves "forward
>> linearly" through time.
>>
>> The confusing part is that during the initial setup steps of the map task, a
>> series of different configurations are used. The noteworthy section of code
>> is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
>> is passed in; this is immediately used as the basis for a
>> TaskAttemptContext. The TAC is then used to initialize the InputFormat and
>> the RecordReader. The JobContext is then re-used to instantiate a
>> MapContext. The RecordReader's "initialize" method is then called with this
>> context, ostensibly to "switch the RR over" to the MapContext. The Mapper
>> itself is then run with the MapContext. Each of these two new Context
>> objects makes a deep copy of the Configuration present in JobContext.
>>
>> The problem here is that if the InputFormat sets any Configuration settings,
>> the RecordReader will properly receive those during its construction -- but
>> the same RecordReader may be using a *different* context and thus a
>> *different* configuration during the actual running of the Mapper itself!
>> LineRecordReader in particular downcasts its TaskAttemptContext to a
>> MapContext at some point during its lifetime, assuming that this
>> initialize() call has been made and that the new context is a MapContext.
>> This is completely type-unsafe, and prevents LineRecordReader from being
>> wrapped inside another RecordReader in all cases.
>>
>> Furthermore, other RecordReader initialize() methods do not do anything;
>> they continue to use the Context they were created with.
>>
>> So now Configuration settings set in InputFormat.createRecordReader() may or
>> may not be present in the Configuration accessible during
>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>> semantics (and that of any outer RecordReader wrapping this one!).
>>
>> This led to a pretty subtle bug in some code I was writing yesterday using
>> CombineFileInputFormat, which requires that you wrap some RecordReader
>> instances in others.
>>
>> So my questions are:
>> * Is there a solid rationale for isolating the Configuration used in these
>> various points in time?
>> * If not, is there a reason to make those deep copies of the Configuration?
>> or can they all just share a reference to the same Configuration instance?
>> * If we really want deep copies, can the MapContext's copy be based off the
>> TaskAttemptContext's copy, so that we at least have a linear flow of
>> configuration settings through the execution of MapTask.runNewMapper()?
>>
>> I'm happy to write a patch to make these semantics more clear. As it is, I
>> think the notion of needing to reinitialize the RecordReader with a
>> completely different context is error-prone. (CombineFileRecordReader, for
>> example, in its initialize() method, does not call curReader.initialize() to
>> initialize its child. This is a separate bug, which I'll post a patch for,
>> but the design of the context situation makes this more problematic than it
>> otherwise needs to be.)
>>
>> Does anyone have any input on this situation?
>> Thanks,
>> - Aaron Kimball
>>
>

Mime
View raw message