hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasilis Liaskovitis <vlias...@gmail.com>
Subject Re: hadoop idle time on terasort
Date Tue, 08 Dec 2009 21:24:24 GMT
Hi Scott,

thanks for the extra tips, these are very helpful.

On Mon, Dec 7, 2009 at 3:57 PM, Scott Carey <scott@richrelevance.com> wrote:
>
>>
>> I am using hadoop-0.20.1 to run terasort and randsort benchmarking
>> tests on a small 8-node linux cluster. Most runs consist of usually
>> low (<50%) core utilizations in the map and reduce phase, as well as
>> heavy I/O phases . There is usually a large fraction of runtime for
>> which cores are idling and i/o disk traffic is not heavy.
>>
>> On average for the duration of a terasort run I get 20-30% cpu
>> utilization, 10-30% iowait times and the rest 40-70% is idle time.
>> This is data collected with mpstat for the duration of the run across
>> the cores of a specific node. This utilization behaviour is true and
>> symmetric for all tasktracker/data nodes (The namenode cores and I/O
>> are mostly idle, so there doesn¹t seem to be a bottleneck in the
>> namenode).
>>
>
> Look at individual task logs for map and reduce through the UI.  Also, look
> at the cluster utilization during a run -- are most map and reduce slots
> full most of the time, or is the slot utilization low?
>

I think the slots are being highly utilized, but I seem to have
forgotten which option in the web UI allows you to look at the slot
allocations during runtime on each tasktracker. Is this available on
the job details from the jobtracker's ui or somewhere else?

> Running the fair scheduler -- or any scheduler -- that can be configured to
> schedule more than one task per heartbeat can dramatically increase your
> slot utilization if it is low.

I 've been running a single job - would  the scheduler benefits show
up with multiple jobs only, by definition? I am now trying multiple
simultaneous sort jobs with smaller disjoint datasets launched in
parallel by the same user. Do I need to setup any fairscheduler
parameters other than the below?

<property>
  <name>mapred.jobtracker.taskScheduler</name>
  <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>

<property>
    <name>mapred.fairscheduler.assignmultiple</name>
    <value>true</value>
</property>

<property>
  <name>mapred.fairscheduler.allocation.file</name>
  <value>/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml</value>
</property>

fairscheduler.xml is:

<?xml version="1.0"?>
<allocations>
 <pool name="user2">
   <minMaps>12</minMaps>
   <minReduces>12</minReduces>
   <maxRunningJobs>16</maxRunningJobs>
   <weight>4</weight>
 </pool>
 <userMaxJobsDefault>16</userMaxJobsDefault>
</allocations>

With 4 parallel sort jobs, I am noticing that maps execute in parallel
across all jobs.  But reduce tasks are only allocated/executed from a
single job at a time, until that job finishes. Is that expected or am
I missing something in my fairscheduler (or other) settings?

> Next, if you find that your delay times correspond with the shuffle phase
> (look in the reduce logs), there are fixes in 0.21 for that on the way, but
> there is a quick win, one line change that cuts shuffle times down a lot on
> clusters that have a large ratio of map tasks per node if the map output is
> not too large.  For a pure sort test, the map outputs are medium sized (the
> same size as the input), so this might not help.  But the indicators of the
> issue are in the reduce task logs.
> See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my
> comment from June 10 2009.
>

For a single big sort job, I have ~2300maps and 84 reduces on a 7node
cluster with 12-core nodes. The thread dumps for my unpatched version
also show sleeping threads at fetchOutputs() - I don't know how often
you 've seen it in your own task dumps.  From what I understand, what
we are looking for in the reduce logs, in terms of a shuffle idle
bottleneck, is the time elapsed between the "shuffling" lines to the
"in-memory merge complete" lines, does that sound right? With the one
line change in fetchOutput(), I 've sometimes seen the average shuffle
time across tasks go down by ~5-10% and so has the execution time. But
the results are variable across runs, I need to look at the reduce
logs and repeat the experiment.

>From your  system description ticket-318 notes, I see you have
configured your cluster to do "10 concurrent shuffle copies".  Does
this refer to  the parameter mapred.reduce.parallel.copies (default 5)
or io.sort.factor (default 10)? I retried with double the
reduce.parallel.copies from the default 5 to 10, but it didn't help.

> My summary is that the hadoop scheduling process has not been so far for
> servers that can run more than 6 or so tasks at once.  A server capable of
> running 12 maps is especially prone to running under-utilized.  Many changes
> in the 0.21 timeframe address some of this.
>

What levels of utilization have you achieved on servers capable of
running 10-12 maps/reduce slots? I understand that this depends on the
type and number of jobs. I suspect you 've seen higher utilizations
when having more concurrent jobs. Were you using the fairscheduler
instead of the default one?

If you can suggest any public hadoop examples/benchmarks that allow
for the "cascade-type" MR jobs that you refer too, please share.

thanks again,

- Vasilis

Mime
View raw message