hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elaine Gan <elaine-...@gmo.jp>
Subject Re: Understanding of the hadoop distribution system (tuning)
Date Tue, 11 Sep 2012 06:07:45 GMT
Hi Hermanth

Thank you for your detailed answered. Your answers helped me much in
understanding, especially on the Job UI.

Sorry, i missed out my specs.
NameNode (JobTracker) : CPUx4
DataNode (TaskTracker) : CPUx4

I am replying inline too.

> > I have a data of around 518MB, and i wrote a MR program to process it.
> > Here are some of my settings in my mapred-site.xml.
> > ---------------------------------------------------------------
> > mapred.tasktracker.map.tasks.maximum = 20
> > mapred.tasktracker.reduce.tasks.maximum = 20
> > ---------------------------------------------------------------
> >
> These two configurations essentially tell the tasktrackers that they can
> run 20 maps and 20 reduces in parallel on a machine. Is this what you
> intended ? (Generally the sum of these two values should equal the number
> of cores on your tasktracker node, or a little more).
> Also, would help if you can tell us your cluster size - i.e. number of
> slaves.

Cluster size (No of slaves) = 4

Yes, i meant the maximum tasks that could be run in A machine is 20
tasks, both map & reduce.

> > My block size is default, 64MB
> > With my data size = 518MB, i guess setting the maximum for MR task to 20
> > is far more than enough (518/64 = 8) , did i get it correctly?
> >
> >
> I suppose what you want is to run all the maps in parallel. For that, the
> number of map slots in your cluster should be more than the number of maps
> of your job (assuming there's a single job running). If the number of slots
> is less than number of maps, the maps would be scheduled in multiple waves.
> On your jobtracker main page, the Cluster Summary > Map Task Capacity gives
> you the total slots available in your cluster.

My Map Task Capacity = 80%
So, from the explanation and from my data size and configuration,
Data size = 518MB
Number of map tasks required =  518/64 = 8 tasks
This 8 tasks should be spread among 4 slaves, which means each nodes
should be able to handle at least 2 tasks.
And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is
more than enough, so it means the approach is correct?
(Well i have CPUx4 in my machine, so in case of large data, i should
divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)

> > When i run the MR program, i could see in the Map/Reduce Administration
> > page that the number of Maps Total = 8, so i assume that everything is
> > going well here, once again if i'm wrong please correct me.
> > (Sometimes it shows only Maps Total = 3)
> >
> This value tells us the number of maps that will run for the job.


> > There's one thing which i'm uncertain about hadoop distribution.
> > Is the Maps Total = 8 means that there are 8 map tasks split among all
> > the data nodes (task trackers)?
> > Is there anyway i can checked whether all the tasks are shared among
> > datanodes (where task trackers are working).
> >
> There's no easy way to check this. The task page for every task shows the
> attempts that ran for each task and where they ran under the 'Machine'
> column.

Thank you, i see that they're processed on different "Machine", so i
guess it's working correctly :)

> > When i clicked on each link under that Task Id, i can see there's "Input
> > Split Locations" stated under each task details, if the inputs are
> > splitted between data nodes, does that means that everything is working
> > well?
> >
> >
> I think this is just the location of the splits, including the replicas.
> What you could see is if enough data local maps ran - which means that the
> tasks mostly got their inputs from datanodes running on the same machine as
> themselves. This is given by the counter "Data-local map tasks" on the job
> UI page.
There are two cases under the Job UI.
Counter                   Map Reduce Total
Case (1)
Launched map tasks 0 0 4 
Data-local map tasks 0 0 4 

Case (2)
Launched map tasks 0 0 2 
Data-local map tasks 0 0 1 

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?

But anyway, is this monitoring needed for tuning performance?

Thank you.

View raw message