hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elaine Gan <elaine-...@gmo.jp>
Subject Understanding of the hadoop distribution system (tuning)
Date Tue, 11 Sep 2012 01:56:10 GMT

I'm new to hadoop and i've just played around with map reduce.
I would like to check if my understanding to hadoop is correct and i
would appreciate if anyone could correct me if i'm wrong.

I have a data of around 518MB, and i wrote a MR program to process it.
Here are some of my settings in my mapred-site.xml.
mapred.tasktracker.map.tasks.maximum = 20
mapred.tasktracker.reduce.tasks.maximum = 20
My block size is default, 64MB
With my data size = 518MB, i guess setting the maximum for MR task to 20
is far more than enough (518/64 = 8) , did i get it correctly?

When i run the MR program, i could see in the Map/Reduce Administration
page that the number of Maps Total = 8, so i assume that everything is
going well here, once again if i'm wrong please correct me.
(Sometimes it shows only Maps Total = 3)

There's one thing which i'm uncertain about hadoop distribution.
Is the Maps Total = 8 means that there are 8 map tasks split among all
the data nodes (task trackers)?
Is there anyway i can checked whether all the tasks are shared among
datanodes (where task trackers are working). 
When i clicked on each link under that Task Id, i can see there's "Input
Split Locations" stated under each task details, if the inputs are
splitted between data nodes, does that means that everything is working

I need to make sure i got everything running well because my MR took
around 6 hours to finish despite the input size is small.. (Well, i know
hadoop is not meant for small data), I'm not sure whether it's my
configuration that goes wrong or hadoop is just not suitable for my case.
I'm actually running a mahout kmeans analysis.

Thank you for your time.

View raw message