hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Basic question on how reducer works
Date Mon, 09 Jul 2012 03:38:44 GMT
Pavan,

This is covered in the MR tutorial doc:
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs

On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <pavan.baburao@gmail.com> wrote:
> I too had similar problems.
> I guess we should also set the debug mode for
> that specific class in the log4j.properties file .Isn't it?
>
> And I didn't quite get what you mean by task's userlogs?
> where are these logs located ? In the logs directory I only see
> logs for all the daemons.Thanks
>
>
> On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
>>
>> I see. I was looking into tasktracker log :).
>>
>> Thanks a lot,
>> Robert
>>
>> ________________________________
>> From: Harsh J <harsh@cloudera.com>
>> To: Grandl Robert <rgrandl@yahoo.com>; mapreduce-user
>> <mapreduce-user@hadoop.apache.org>
>> Sent: Sunday, July 8, 2012 9:16 PM
>>
>> Subject: Re: Basic question on how reducer works
>>
>> The changes should appear in your Task's userlogs (not the TaskTracker
>> logs). Have you deployed your changed code properly (i.e. do you
>> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
>>
>> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
>> > Hi Harsh,
>> >
>> > Your comments were extremely helpful.
>> >
>> > Still I am wondering why if I add LOG.info entries into MapTask.java or
>> > ReduceTask.java in most of the functions(including
>> > Old/NewOutputCollector),
>> > the logs are not shown. In this way it's hard for me to track which
>> > functions are called and which not. Even more in ReduceTask.java.
>> >
>> > Do you have any ideas ?
>> >
>> > Thanks a lot for your answer,
>> > Robert
>> >
>> > ________________________________
>> > From: Harsh J <harsh@cloudera.com>
>> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <rgrandl@yahoo.com>
>> > Sent: Sunday, July 8, 2012 1:34 AM
>> >
>> > Subject: Re: Basic question on how reducer works
>> >
>> > Hi Robert,
>> >
>> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
>> > alone, but certain things may vary for Hadoop 2.x).
>> >
>> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
>> >> Hi,
>> >>
>> >> I have some questions related to basic functionality in Hadoop.
>> >>
>> >> 1. When a Mapper process the intermediate output data, how it knows how
>> >> many
>> >> partitions to do(how many reducers will be) and how much data to go in
>> >> each
>> >> partition for each reducer ?
>> >
>> > The number of reducers is non-dynamic and is user-specified, and is
>> > set in the job configuration. Hence the Partitioner knows about the
>> > value it needs to use for its numPartitions (== numReduces for the
>> > job).
>> >
>> > For this one in 1.x code, look at MapTask.java, in the constructors of
>> > internal classes OldOutputCollector (Stable API) and
>> > NewOutputCollector (New API).
>> >
>> > The data estimated to be going into a partition, for limit/scheduling
>> > checks, is currently a naive computation, done by summing upon the
>> > estimate output sizes of each map. See
>> > ResourceEstimator#getEstimatedReduceInputSize for the overall
>> > estimation across maps, and see Task#calculateOutputSize for the
>> > per-map estimation code.
>> >
>> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> >> the
>> >> locations of intermediate output data where it should retrieve it right
>> >> ?
>> >> But how a reducer will know from each remote location with intermediate
>> >> output what portion it has to retrieve only ?
>> >
>> > The JT does not send in the information of locations when a reduce is
>> > scheduled. When the reducers begin their shuffle phase, they query the
>> > TaskTracker to get the map completion events, via
>> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
>> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
>> > get this info underneath. The returned structure carries the host that
>> > has completed the map successfully, which the Reduce's copier relies
>> > on to fetch the data from the right host's TT.
>> >
>> > The reduce merely asks the data assigned for it for the specific
>> > completed maps at each TT. Note that a reduce task ID is also its
>> > partition ID, so it merely has to ask the data for its own task ID #
>> > and the TT serves, over HTTP, the right parts of the intermediate data
>> > to it.
>> >
>> > Feel free to ping back if you need some more clarification! :)
>> >
>> > --
>> > Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>>
>
>
>
> --
>
> --With Regards
> Pavan Kulkarni
>



-- 
Harsh J

Mime
View raw message