hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Of Configurations and Contexts
Date Thu, 11 Feb 2010 01:17:58 GMT
This patch would allow the configuration to flow through the map task in a
linear fashion:

diff --git src/java/org/apache/hadoop/mapred/MapTask.java
src/java/org/apache/hadoop/mapred/MapTask.
index f889a47..4170922 100644
--- src/java/org/apache/hadoop/mapred/MapTask.java
+++ src/java/org/apache/hadoop/mapred/MapTask.java
@@ -637,7 +637,9 @@ class MapTask extends Task {

     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
OUTVALUE>
     mapContext =
-      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
getTaskID(),
+      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
+          taskContext.getConfiguration(),
+          getTaskID(),
           input, output,
           committer,
           reporter, split);


On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <todd@cloudera.com> wrote:

> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <todd@cloudera.com> wrote:
> > This is, if I'm understanding Aaron correctly, the same issue that
> > makes the mapred.input.file configuration very hard to implement in
> > the new API.
> >
>
> Sorry, I should clarify this. Obviously that particular feature can be
> accessed by downcasting the input split to FileSplit. But this is very
> hard to deal with when the input format wants to use a different
> implementation class - you end up coupling the mapper to the
> inputformat in a dirty way. Or, if you want to access the input file
> name from the OutputFormat, I believe you're entirely out of luck
> (though I haven't looked in trunk). In the prior API where
> Configuration got passed through the flow nicely, it was trivial to do
> this.
>
> -Todd
>
> > -Todd
> >
> > On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com>
> wrote:
> >> Hi folks,
> >>
> >> I've uncovered some behavior in Hadoop that I found surprising. I think
> this
> >> represents a design flaw that I'd like to see corrected.
> >>
> >> As we well know, decoupled components in a MapReduce job communicate
> >> information forward through the use of Configuration instances. Every
> >> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
> >> Configuration object inside, accessible via getConfiguration().
> >>
> >> The semantics of passing data from the "configuration phase" to the "run
> >> phase" is easy; the user creates a Job on the client machine, populates
> its
> >> Configuration with necessary values, and all those values will be
> visible in
> >> the JobContext received in the map/reduce tasks themselves. Every task
> >> expects to get the same view of the user-configured values here.
> >>
> >> Similarly, in my Mapper, if during the setup() method I call
> >> context.getConfiguration().set("foo","bar"), I expect that
> >> context.getConfiguration.get("foo") returns "bar" during the cleanup()
> >> method. During a map task's execution, the configuration moves "forward
> >> linearly" through time.
> >>
> >> The confusing part is that during the initial setup steps of the map
> task, a
> >> series of different configurations are used. The noteworthy section of
> code
> >> is MapTask.java in the runNewMapper() method (lines 607--650). A
> JobContext
> >> is passed in; this is immediately used as the basis for a
> >> TaskAttemptContext. The TAC is then used to initialize the InputFormat
> and
> >> the RecordReader. The JobContext is then re-used to instantiate a
> >> MapContext. The RecordReader's "initialize" method is then called with
> this
> >> context, ostensibly to "switch the RR over" to the MapContext. The
> Mapper
> >> itself is then run with the MapContext. Each of these two new Context
> >> objects makes a deep copy of the Configuration present in JobContext.
> >>
> >> The problem here is that if the InputFormat sets any Configuration
> settings,
> >> the RecordReader will properly receive those during its construction --
> but
> >> the same RecordReader may be using a *different* context and thus a
> >> *different* configuration during the actual running of the Mapper
> itself!
> >> LineRecordReader in particular downcasts its TaskAttemptContext to a
> >> MapContext at some point during its lifetime, assuming that this
> >> initialize() call has been made and that the new context is a
> MapContext.
> >> This is completely type-unsafe, and prevents LineRecordReader from being
> >> wrapped inside another RecordReader in all cases.
> >>
> >> Furthermore, other RecordReader initialize() methods do not do anything;
> >> they continue to use the Context they were created with.
> >>
> >> So now Configuration settings set in InputFormat.createRecordReader()
> may or
> >> may not be present in the Configuration accessible during
> >> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> >> semantics (and that of any outer RecordReader wrapping this one!).
> >>
> >> This led to a pretty subtle bug in some code I was writing yesterday
> using
> >> CombineFileInputFormat, which requires that you wrap some RecordReader
> >> instances in others.
> >>
> >> So my questions are:
> >> * Is there a solid rationale for isolating the Configuration used in
> these
> >> various points in time?
> >> * If not, is there a reason to make those deep copies of the
> Configuration?
> >> or can they all just share a reference to the same Configuration
> instance?
> >> * If we really want deep copies, can the MapContext's copy be based off
> the
> >> TaskAttemptContext's copy, so that we at least have a linear flow of
> >> configuration settings through the execution of MapTask.runNewMapper()?
> >>
> >> I'm happy to write a patch to make these semantics more clear. As it is,
> I
> >> think the notion of needing to reinitialize the RecordReader with a
> >> completely different context is error-prone. (CombineFileRecordReader,
> for
> >> example, in its initialize() method, does not call
> curReader.initialize() to
> >> initialize its child. This is a separate bug, which I'll post a patch
> for,
> >> but the design of the context situation makes this more problematic than
> it
> >> otherwise needs to be.)
> >>
> >> Does anyone have any input on this situation?
> >> Thanks,
> >> - Aaron Kimball
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message