hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Kulkarni <pavan.babu...@gmail.com>
Subject Re: Basic question on how reducer works
Date Mon, 09 Jul 2012 02:56:35 GMT
I too had similar problems.
I guess we should also set the debug mode for
that specific class in the log4j.properties file .Isn't it?

And I didn't quite get what you mean by task's userlogs?
where are these logs located ? In the logs directory I only see
logs for all the daemons.Thanks


On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rgrandl@yahoo.com> wrote:

> I see. I was looking into tasktracker log :).
>
> Thanks a lot,
> Robert
>
>   ------------------------------
> *From:* Harsh J <harsh@cloudera.com>
> *To:* Grandl Robert <rgrandl@yahoo.com>; mapreduce-user <
> mapreduce-user@hadoop.apache.org>
> *Sent:* Sunday, July 8, 2012 9:16 PM
>
> *Subject:* Re: Basic question on how reducer works
>
> The changes should appear in your Task's userlogs (not the TaskTracker
> logs). Have you deployed your changed code properly (i.e. do you
> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
>
> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
> > Hi Harsh,
> >
> > Your comments were extremely helpful.
> >
> > Still I am wondering why if I add LOG.info <http://log.info/> entries
> into MapTask.java or
> > ReduceTask.java in most of the functions(including
> Old/NewOutputCollector),
> > the logs are not shown. In this way it's hard for me to track which
> > functions are called and which not. Even more in ReduceTask.java.
> >
> > Do you have any ideas ?
> >
> > Thanks a lot for your answer,
> > Robert
> >
> > ________________________________
> > From: Harsh J <harsh@cloudera.com>
> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <rgrandl@yahoo.com>
> > Sent: Sunday, July 8, 2012 1:34 AM
> >
> > Subject: Re: Basic question on how reducer works
> >
> > Hi Robert,
> >
> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
> > alone, but certain things may vary for Hadoop 2.x).
> >
> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
> >> Hi,
> >>
> >> I have some questions related to basic functionality in Hadoop.
> >>
> >> 1. When a Mapper process the intermediate output data, how it knows how
> >> many
> >> partitions to do(how many reducers will be) and how much data to go in
> >> each
> >> partition for each reducer ?
> >
> > The number of reducers is non-dynamic and is user-specified, and is
> > set in the job configuration. Hence the Partitioner knows about the
> > value it needs to use for its numPartitions (== numReduces for the
> > job).
> >
> > For this one in 1.x code, look at MapTask.java, in the constructors of
> > internal classes OldOutputCollector (Stable API) and
> > NewOutputCollector (New API).
> >
> > The data estimated to be going into a partition, for limit/scheduling
> > checks, is currently a naive computation, done by summing upon the
> > estimate output sizes of each map. See
> > ResourceEstimator#getEstimatedReduceInputSize for the overall
> > estimation across maps, and see Task#calculateOutputSize for the
> > per-map estimation code.
> >
> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
> the
> >> locations of intermediate output data where it should retrieve it right
> ?
> >> But how a reducer will know from each remote location with intermediate
> >> output what portion it has to retrieve only ?
> >
> > The JT does not send in the information of locations when a reduce is
> > scheduled. When the reducers begin their shuffle phase, they query the
> > TaskTracker to get the map completion events, via
> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
> > get this info underneath. The returned structure carries the host that
> > has completed the map successfully, which the Reduce's copier relies
> > on to fetch the data from the right host's TT.
> >
> > The reduce merely asks the data assigned for it for the specific
> > completed maps at each TT. Note that a reduce task ID is also its
> > partition ID, so it merely has to ask the data for its own task ID #
> > and the TT serves, over HTTP, the right parts of the intermediate data
> > to it.
> >
> > Feel free to ping back if you need some more clarification! :)
> >
> > --
> > Harsh J
> >
> >
>
>
>
> --
> Harsh J
>
>
>


-- 

--With Regards
Pavan Kulkarni

Mime
View raw message