hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: hadoop idle time on terasort
Date Wed, 09 Dec 2009 19:00:05 GMT



On 12/8/09 1:24 PM, "Vasilis Liaskovitis" <vliaskov@gmail.com> wrote:

> Hi Scott,
> 
> thanks for the extra tips, these are very helpful.
> 
>> 
> 
> I think the slots are being highly utilized, but I seem to have
> forgotten which option in the web UI allows you to look at the slot
> allocations during runtime on each tasktracker. Is this available on
> the job details from the jobtracker's ui or somewhere else?
> 

I usually just look at the jobtracker's main page, that lists the total
number of slots for the cluster, and how many are in use.
"Maps, Reduces, Map Task Capacity, Reduce Task Capacity"
I usually don't drill down to a per trasktracker basis.

I also look at the running jobs on the same main page, and see the
allocation of tasks between them.  Clicking down to an individual job lets
you see the tasks of each job and view their logs.  That is usually
sufficient (combined with top, mpstat, iostat, etc) to see how the
scheduling is working and correlate tasks or phases of hadoop with certain
system behavior.

>> Running the fair scheduler -- or any scheduler -- that can be configured to
>> schedule more than one task per heartbeat can dramatically increase your
>> slot utilization if it is low.
> 
> I 've been running a single job - would  the scheduler benefits show
> up with multiple jobs only, by definition? I am now trying multiple
> simultaneous sort jobs with smaller disjoint datasets launched in
> parallel by the same user. Do I need to setup any fairscheduler
> parameters other than the below?

It can help with only one job as well, due to the 'assignmultiple' feature.
Often, the default scheduler will assign one task per tasktracker ping.
This is not enough to keep up, or to ramp up at the start of the job
quickly.
This is something you can see in the UI.  If jobs are completing faster than
the scheduler can assign new tasks the slot utilization will be low.

> 
> <property>
>   <name>mapred.jobtracker.taskScheduler</name>
>   <value>org.apache.hadoop.mapred.FairScheduler</value>
> </property>
> 
> <property>
>     <name>mapred.fairscheduler.assignmultiple</name>
>     <value>true</value>
> </property>
> 
> <property>
>   <name>mapred.fairscheduler.allocation.file</name>
>   <value>/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml</value>
> </property>
> 
> fairscheduler.xml is:
> 
> <?xml version="1.0"?>
> <allocations>
>  <pool name="user2">
>    <minMaps>12</minMaps>
>    <minReduces>12</minReduces>
>    <maxRunningJobs>16</maxRunningJobs>
>    <weight>4</weight>
>  </pool>
>  <userMaxJobsDefault>16</userMaxJobsDefault>
> </allocations>
> 
> With 4 parallel sort jobs, I am noticing that maps execute in parallel
> across all jobs.  But reduce tasks are only allocated/executed from a
> single job at a time, until that job finishes. Is that expected or am
> I missing something in my fairscheduler (or other) settings?
> 

The config seems fine.

The reduce tasks all belonging to one job seems a bit odd to me, usually
reduce tasks are distributed amongst jobs as well as map tasks.   However,
there is another parameter that defines how far along the map phase of a job
has to be before it begins scheduling reduces.  That, combined with how many
total reduce slots you have on your cluster and how many each job is trying
to create, might create the behavior you are seeing.

I don't think I set the minMaps or minReduces and don't know the defaults.
Maybe that is affecting it?


>> See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my
>> comment from June 10 2009.
>> 
> 
> For a single big sort job, I have ~2300maps and 84 reduces on a 7node
> cluster with 12-core nodes. The thread dumps for my unpatched version
> also show sleeping threads at fetchOutputs() - I don't know how often
> you 've seen it in your own task dumps.  From what I understand, what
> we are looking for in the reduce logs, in terms of a shuffle idle
> bottleneck, is the time elapsed between the "shuffling" lines to the
> "in-memory merge complete" lines, does that sound right? With the one
> line change in fetchOutput(), I 've sometimes seen the average shuffle
> time across tasks go down by ~5-10% and so has the execution time. But
> the results are variable across runs, I need to look at the reduce
> logs and repeat the experiment.

For something like sort, which does not reduce the size of the data prior to
a reduce in a combiner or other filter, I would not expect the same massive
gains that I did in that ticket.
I built a close to worse case example, where each reducer had to fetch 20+
map fragments from each node where the fetch time for each chunk is a few
milliseconds. In addition, the remainder of the reduce work is trivial, in
your case there is a lot of merging and spilling to do.  This will make the
logs harder to interpret -- there's more going on.

In your 7 node case, with 2300 maps, the default behavior would cause a
delay of a couple seconds * (2300/7). Is that roughly the 10% you see?

> 
> From your  system description ticket-318 notes, I see you have
> configured your cluster to do "10 concurrent shuffle copies".  Does
> this refer to  the parameter mapred.reduce.parallel.copies (default 5)
> or io.sort.factor (default 10)? I retried with double the
> reduce.parallel.copies from the default 5 to 10, but it didn't help.
> 

The former.  I also have modified sort factor and sort memory (to 64 and
256, I think) but that is tuned to my workload and job types.

Tuning parallel copies is somewhat workload dependent -- For something like
sort that emits as much data as it ingests, there shouldn't be a large
effect.

>> My summary is that the hadoop scheduling process has not been so far for
>> servers that can run more than 6 or so tasks at once.  A server capable of
>> running 12 maps is especially prone to running under-utilized.  Many changes
>> in the 0.21 timeframe address some of this.
>> 
> 
> What levels of utilization have you achieved on servers capable of
> running 10-12 maps/reduce slots? I understand that this depends on the
> type and number of jobs. I suspect you 've seen higher utilizations
> when having more concurrent jobs. Were you using the fairscheduler
> instead of the default one?

With the default scheduler, and a cluster configured with 13 maps and 11
reduces per machine, I could not get a sequence of ~30 M/R jobs to utilize
more than 35% of a node.  (where 100% is measured by either having all slots
full OR being bottlenecked by I/O or CPU).

Using the fair scheduler increased the utilization to 85% + in terms of slot
occupation, with 99%+ when jobs that work on larger data sets were active
and less when dependent jobs working on smaller files (outputs of previous
large jobs) are completing more rapidly than the scheduler can keep up.

But after the fair scheduler tuning to increase slot usage, the systems were
frequently not I/O or CPU bound.  The time was often waiting around in the
shuffle sleeping.  After that fix, CPU usage in those phases increased
dramatically, and now the job sequence is often CPU bound, with occasional
I/O bound periods.  The increase in the speed of the reduce due to these
changes however does lead to some times where the 0.19.x fair scheduler
can't keep up with the rate of tasks completing, since it can only assign
one map and one reduce per tasktracker ping.  During ramp-up of jobs with
more tasks, the utilization is still 70% ish and could be better.
Later versions fix this, and can assign many tasks of both types, so I
expect that a later upgrade to 0.21 will have higher utilization still.


> 
> If you can suggest any public hadoop examples/benchmarks that allow
> for the "cascade-type" MR jobs that you refer too, please share.
> 

Most of the public examples don't do anything near as complicated as a real
world problem with a dependency chain of jobs and intermediate data.  The
stuff in the PigMix benchmarks do a few individual tasks that might
represent components of a larger data flow, and are useful as reference for
more complicated problems than the generic sort, word count, etc.  But they
are still fairly simple.
http://wiki.apache.org/pig/PigMix



> thanks again,
> 
> - Vasilis
> 


Mime
View raw message