hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Of Configurations and Contexts
Date Wed, 17 Feb 2010 00:16:06 GMT
Not really. I think it's just a program flow wiring issue. See the patch at
MAPREDUCE-1486.

On Fri, Feb 12, 2010 at 8:43 AM, Chris K Wensel <chris@wensel.net> wrote:

> Is this a side effect driven by having all the static accessors speak to
> specific Context types instead of Configuration?
>
> https://issues.apache.org/jira/browse/MAPREDUCE-851
>
> ckw
>
> On Feb 11, 2010, at 6:13 PM, Aaron Kimball wrote:
>
> > Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a
> concrete
> > test case.
> >
> > On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aaron@cloudera.com>
> wrote:
> >
> >> This patch would allow the configuration to flow through the map task in
> a
> >> linear fashion:
> >>
> >> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
> >> src/java/org/apache/hadoop/mapred/MapTask.
> >> index f889a47..4170922 100644
> >> --- src/java/org/apache/hadoop/mapred/MapTask.java
> >> +++ src/java/org/apache/hadoop/mapred/MapTask.java
> >> @@ -637,7 +637,9 @@ class MapTask extends Task {
> >>
> >>     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
> >> OUTVALUE>
> >>     mapContext =
> >> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
> >> getTaskID(),
> >> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
> >> +          taskContext.getConfiguration(),
> >> +          getTaskID(),
> >>           input, output,
> >>           committer,
> >>           reporter, split);
> >>
> >>
> >>
> >> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >>
> >>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> >>>> This is, if I'm understanding Aaron correctly, the same issue that
> >>>> makes the mapred.input.file configuration very hard to implement in
> >>>> the new API.
> >>>>
> >>>
> >>> Sorry, I should clarify this. Obviously that particular feature can be
> >>> accessed by downcasting the input split to FileSplit. But this is very
> >>> hard to deal with when the input format wants to use a different
> >>> implementation class - you end up coupling the mapper to the
> >>> inputformat in a dirty way. Or, if you want to access the input file
> >>> name from the OutputFormat, I believe you're entirely out of luck
> >>> (though I haven't looked in trunk). In the prior API where
> >>> Configuration got passed through the flow nicely, it was trivial to do
> >>> this.
> >>>
> >>> -Todd
> >>>
> >>>> -Todd
> >>>>
> >>>> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com>
> >>> wrote:
> >>>>> Hi folks,
> >>>>>
> >>>>> I've uncovered some behavior in Hadoop that I found surprising.
I
> think
> >>> this
> >>>>> represents a design flaw that I'd like to see corrected.
> >>>>>
> >>>>> As we well know, decoupled components in a MapReduce job communicate
> >>>>> information forward through the use of Configuration instances.
Every
> >>>>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries
a
> >>>>> Configuration object inside, accessible via getConfiguration().
> >>>>>
> >>>>> The semantics of passing data from the "configuration phase" to
the
> >>> "run
> >>>>> phase" is easy; the user creates a Job on the client machine,
> populates
> >>> its
> >>>>> Configuration with necessary values, and all those values will be
> >>> visible in
> >>>>> the JobContext received in the map/reduce tasks themselves. Every
> task
> >>>>> expects to get the same view of the user-configured values here.
> >>>>>
> >>>>> Similarly, in my Mapper, if during the setup() method I call
> >>>>> context.getConfiguration().set("foo","bar"), I expect that
> >>>>> context.getConfiguration.get("foo") returns "bar" during the
> cleanup()
> >>>>> method. During a map task's execution, the configuration moves
> "forward
> >>>>> linearly" through time.
> >>>>>
> >>>>> The confusing part is that during the initial setup steps of the
map
> >>> task, a
> >>>>> series of different configurations are used. The noteworthy section
> of
> >>> code
> >>>>> is MapTask.java in the runNewMapper() method (lines 607--650). A
> >>> JobContext
> >>>>> is passed in; this is immediately used as the basis for a
> >>>>> TaskAttemptContext. The TAC is then used to initialize the
> InputFormat
> >>> and
> >>>>> the RecordReader. The JobContext is then re-used to instantiate
a
> >>>>> MapContext. The RecordReader's "initialize" method is then called
> with
> >>> this
> >>>>> context, ostensibly to "switch the RR over" to the MapContext. The
> >>> Mapper
> >>>>> itself is then run with the MapContext. Each of these two new Context
> >>>>> objects makes a deep copy of the Configuration present in JobContext.
> >>>>>
> >>>>> The problem here is that if the InputFormat sets any Configuration
> >>> settings,
> >>>>> the RecordReader will properly receive those during its construction
> --
> >>> but
> >>>>> the same RecordReader may be using a *different* context and thus
a
> >>>>> *different* configuration during the actual running of the Mapper
> >>> itself!
> >>>>> LineRecordReader in particular downcasts its TaskAttemptContext
to a
> >>>>> MapContext at some point during its lifetime, assuming that this
> >>>>> initialize() call has been made and that the new context is a
> >>> MapContext.
> >>>>> This is completely type-unsafe, and prevents LineRecordReader from
> >>> being
> >>>>> wrapped inside another RecordReader in all cases.
> >>>>>
> >>>>> Furthermore, other RecordReader initialize() methods do not do
> >>> anything;
> >>>>> they continue to use the Context they were created with.
> >>>>>
> >>>>> So now Configuration settings set in InputFormat.createRecordReader()
> >>> may or
> >>>>> may not be present in the Configuration accessible during
> >>>>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> >>>>> semantics (and that of any outer RecordReader wrapping this one!).
> >>>>>
> >>>>> This led to a pretty subtle bug in some code I was writing yesterday
> >>> using
> >>>>> CombineFileInputFormat, which requires that you wrap some
> RecordReader
> >>>>> instances in others.
> >>>>>
> >>>>> So my questions are:
> >>>>> * Is there a solid rationale for isolating the Configuration used
in
> >>> these
> >>>>> various points in time?
> >>>>> * If not, is there a reason to make those deep copies of the
> >>> Configuration?
> >>>>> or can they all just share a reference to the same Configuration
> >>> instance?
> >>>>> * If we really want deep copies, can the MapContext's copy be based
> off
> >>> the
> >>>>> TaskAttemptContext's copy, so that we at least have a linear flow
of
> >>>>> configuration settings through the execution of
> MapTask.runNewMapper()?
> >>>>>
> >>>>> I'm happy to write a patch to make these semantics more clear. As
it
> >>> is, I
> >>>>> think the notion of needing to reinitialize the RecordReader with
a
> >>>>> completely different context is error-prone.
> (CombineFileRecordReader,
> >>> for
> >>>>> example, in its initialize() method, does not call
> >>> curReader.initialize() to
> >>>>> initialize its child. This is a separate bug, which I'll post a
patch
> >>> for,
> >>>>> but the design of the context situation makes this more problematic
> >>> than it
> >>>>> otherwise needs to be.)
> >>>>>
> >>>>> Does anyone have any input on this situation?
> >>>>> Thanks,
> >>>>> - Aaron Kimball
> >>>>>
> >>>>
> >>>
> >>
> >>
>
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://www.concurrentinc.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message