hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Of Configurations and Contexts
Date Wed, 10 Feb 2010 22:25:56 GMT
This is, if I'm understanding Aaron correctly, the same issue that
makes the mapred.input.file configuration very hard to implement in
the new API.


On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com> wrote:
> Hi folks,
> I've uncovered some behavior in Hadoop that I found surprising. I think this
> represents a design flaw that I'd like to see corrected.
> As we well know, decoupled components in a MapReduce job communicate
> information forward through the use of Configuration instances. Every
> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
> Configuration object inside, accessible via getConfiguration().
> The semantics of passing data from the "configuration phase" to the "run
> phase" is easy; the user creates a Job on the client machine, populates its
> Configuration with necessary values, and all those values will be visible in
> the JobContext received in the map/reduce tasks themselves. Every task
> expects to get the same view of the user-configured values here.
> Similarly, in my Mapper, if during the setup() method I call
> context.getConfiguration().set("foo","bar"), I expect that
> context.getConfiguration.get("foo") returns "bar" during the cleanup()
> method. During a map task's execution, the configuration moves "forward
> linearly" through time.
> The confusing part is that during the initial setup steps of the map task, a
> series of different configurations are used. The noteworthy section of code
> is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
> is passed in; this is immediately used as the basis for a
> TaskAttemptContext. The TAC is then used to initialize the InputFormat and
> the RecordReader. The JobContext is then re-used to instantiate a
> MapContext. The RecordReader's "initialize" method is then called with this
> context, ostensibly to "switch the RR over" to the MapContext. The Mapper
> itself is then run with the MapContext. Each of these two new Context
> objects makes a deep copy of the Configuration present in JobContext.
> The problem here is that if the InputFormat sets any Configuration settings,
> the RecordReader will properly receive those during its construction -- but
> the same RecordReader may be using a *different* context and thus a
> *different* configuration during the actual running of the Mapper itself!
> LineRecordReader in particular downcasts its TaskAttemptContext to a
> MapContext at some point during its lifetime, assuming that this
> initialize() call has been made and that the new context is a MapContext.
> This is completely type-unsafe, and prevents LineRecordReader from being
> wrapped inside another RecordReader in all cases.
> Furthermore, other RecordReader initialize() methods do not do anything;
> they continue to use the Context they were created with.
> So now Configuration settings set in InputFormat.createRecordReader() may or
> may not be present in the Configuration accessible during
> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> semantics (and that of any outer RecordReader wrapping this one!).
> This led to a pretty subtle bug in some code I was writing yesterday using
> CombineFileInputFormat, which requires that you wrap some RecordReader
> instances in others.
> So my questions are:
> * Is there a solid rationale for isolating the Configuration used in these
> various points in time?
> * If not, is there a reason to make those deep copies of the Configuration?
> or can they all just share a reference to the same Configuration instance?
> * If we really want deep copies, can the MapContext's copy be based off the
> TaskAttemptContext's copy, so that we at least have a linear flow of
> configuration settings through the execution of MapTask.runNewMapper()?
> I'm happy to write a patch to make these semantics more clear. As it is, I
> think the notion of needing to reinitialize the RecordReader with a
> completely different context is error-prone. (CombineFileRecordReader, for
> example, in its initialize() method, does not call curReader.initialize() to
> initialize its child. This is a separate bug, which I'll post a patch for,
> but the design of the context situation makes this more problematic than it
> otherwise needs to be.)
> Does anyone have any input on this situation?
> Thanks,
> - Aaron Kimball

View raw message