hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasilis Liaskovitis <vlias...@gmail.com>
Subject Re: utilizing all cores on single-node hadoop
Date Sun, 23 Aug 2009 23:56:49 GMT
Hi,

thanks to everyone for the valuable suggestions.

what would be the default number of map and reduce tasks for the
sort-rand example described at:
http://wiki.apache.org/hadoop/Sort
This is one of the simplest possible examples and uses identity mapper/reducers

I am seeing 160 map tasks and 27 reduce tasks on my jobtracker web ui
for a single-node test. The number of map tasks seems particularly
odd, because my tasktracker.reduce.tasks.maximum=30 and
mapred.map.tasks=24 settings were was well below 160.

In general, is the number of map/reduce tasks for a specific job set
by the Mapper/Reducer job-specific java classes or is it inferred
somehow by the framework?

Also, cores may be idle because the job is I/O-bound - what are the
config parameters related to memory/disk buffering of map outputs and
reduce merges?  WIth the default io.sort.mb and io.sort.factor, would
you expect the sort example to be i/o-bound? Some profiling runs
should help investigate this soon, but at this point I am just asking
for any untuition from more  experienced users.

I have switched to using hadoop-0.20.0 (I believe this version has
changed the site-specfic overrides file from conf/hadoop-site.xml to
conf/mapred-site.xml and several other conf/ files. Let me know if the
site overrides don't work or should be changed somewhere else for this
version)
Does 0.20.0 have a different job scheduler or different default
settings than 0.19.2 - I am getting higher core utlizations with
0.20.0 for some jobs e.g. wordcount examples.

thanks,

- Vasilis

On Wed, Aug 19, 2009 at 9:09 AM, Jason Venner<jason.hadoop@gmail.com> wrote:
> Another reason you may not see full utilization of your map tasks per
> tracker is if the mean run time of a task is very short, All the slots are
> being used but the setup and teardown for each task is large enough in time
> compared to the run time of the task that it appears that not all the task
> slots are being used.
>
>
> On Mon, Aug 17, 2009 at 10:35 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:
>
>> While setting mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage
>> your application might have since all tasks will be competing for the same
>> and might reduce overall performance.
>>
>> Thanks,
>> Amogh
>> -----Original Message-----
>> From: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com]
>> Sent: Tuesday, August 18, 2009 10:37 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: utilizing all cores on single-node hadoop
>>
>> Hi Vasilis,
>>
>> Here's some info that I know:
>>
>> mapred.map.tasks - this is a job-specific setting. This is just a hint to
>> InputFormat as to how many InputSplits (and hence MapTasks) you want for
>> your job. The default InputFormat classes usually keep each split size to
>> the HDFS block size (64MB default). So if your input data is less than 64
>> MB, it will just result in only 1 split and hence 1 MapTask only.
>>
>> mapred.reduce.tasks - this is also a job-specific setting.
>>
>> mapred.tasktracker.map.tasks.maximum
>> mapred.tasktracker.reduce.tasks.maximum
>>
>> The above 2 are tasktracker-specific config options and determine how many
>> "simultaneous" MapTasks and ReduceTasks run on each TT. Ideally on a 8-core
>> box, you would want to set map.tasks.maximum to something like 6 and
>> reduce.tasks.maximum to 4 to utilize all the 8 cores to the maximum
>> (there's
>> a little bit of over-subscription to account for tasks idling while doing
>> I/O).
>>
>> In the web admin console, how many map-tasks and reduce-tasks are reported
>> to have been launched for your job?
>>
>> Cheers,
>> Harish
>>
>> On Tue, Aug 18, 2009 at 5:47 AM, Vasilis Liaskovitis <vliaskov@gmail.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I am a beginner trying to setup a few simple hadoop tests on a single
>> > node before moving on to a cluster. I am just using the simple
>> > wordcount example for now. My question is what's the best way to
>> > guarantee utilization of all cores on a single-node? So assuming a
>> > single node with 16-cores what are the suggested values for:
>> >
>> > mapred.map.tasks
>> > mapred.reduce.tasks
>> >
>> mapred.tasktracker.map.tasks.maximum
>> > mapred.tasktracker.map.tasks.maxium
>> >
>>
>> > I found an old similar thread
>> > http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00152.html
>> > and I have followed similar settings for my 16-core system (e.g.
>> > map.tasks=reduce.tasks=90 and map.tasks.maximum=100), however I always
>> > see only 3-4 cores utilized using top.
>> >
>> > - The description for mapred.map.tasks says "Ignored when
>> > mapred.job.tracker is "local" ", and in my case
>> > mapred.job.tracker=hdfs://localhost:54311
>> > is it possible that the map.tasks and reduce.tasks I am setting are
>> > being ignored? How can I verify this? Is there a way to enforce my
>> > values even on a localhost scenario like this?
>> >
>> > - Are there other config options/values that I need to set besides the
>> > 4 I mentioned above?
>> >
>> > - Also is it possible that for short tasks, I won't see full
>> > utilization of all cores anyway? Something along those lines is
>> > mentioned in an issue a year ago:
>> > http://issues.apache.org/jira/browse/HADOOP-3136
>> > "If the individual tasks are very short i.e. run for less than the
>> > heartbeat interval the TaskTracker serially runs one task at a time"
>> >
>> > I am using hadoop-0.19.2
>> >
>> > thanks for any guidance,
>> >
>> > - Vasilis
>> >
>>
>>
>>
>> --
>> Harish Mallipeddi
>> http://blog.poundbang.in
>>
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Mime
View raw message