Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of elaine-gan@gmo.jp designates
 210.157.4.33 as permitted sender)
Date: Tue, 11 Sep 2012 15:07:45 +0900
From: Elaine Gan <elaine-gan@gmo.jp>
To: user@hadoop.apache.org
Subject: Re: Understanding of the hadoop distribution system (tuning)
In-Reply-To: 
 <CAEAKFL8EF_sb8GwpsbJQtQ4P+dYTQoxY7-ZQ+sHudc_mV8bYrQ@mail.gmail.com>
References: <20120911105610.2C8B.13FE4A9A@gmo.jp>
 <CAEAKFL8EF_sb8GwpsbJQtQ4P+dYTQoxY7-ZQ+sHudc_mV8bYrQ@mail.gmail.com>
Message-Id: <20120911150744.2C90.13FE4A9A@gmo.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> 
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> 
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.

OK


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.
> 

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> 
> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
> 
There are two cases under the Job UI.
Counter                   Map Reduce Total
-----------------------------------------
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?


Thank you.