hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: hadoop idle time on terasort
Date Wed, 09 Dec 2009 19:15:25 GMT
As always, Scott provided lots of great advice below. One note to be aware
of:

The fair scheduler "assignmultiple" feature in 0.20 doesn't do quite what
you think it might. It gives the ability to assign one map and one reduce
per TT heartbeat, but doesn't assign multiple map tasks in a single
heartbeat. So the rate at which map tasks can be assigned is still
reasonably slow. This is why you should strive to have each map task take at
least a minute (ballpark, in my opinion)

MAPREDUCE-706 (in 0.21) changes the assignmultiple feature to allow multiple
task assignments of the same type in a single heartbeat. This helps cluster
utilization when you have short tasks/short jobs.

Thanks
-Todd

On Wed, Dec 9, 2009 at 11:00 AM, Scott Carey <scott@richrelevance.com>wrote:

>
>
>
> On 12/8/09 1:24 PM, "Vasilis Liaskovitis" <vliaskov@gmail.com> wrote:
>
> > Hi Scott,
> >
> > thanks for the extra tips, these are very helpful.
> >
> >>
> >
> > I think the slots are being highly utilized, but I seem to have
> > forgotten which option in the web UI allows you to look at the slot
> > allocations during runtime on each tasktracker. Is this available on
> > the job details from the jobtracker's ui or somewhere else?
> >
>
> I usually just look at the jobtracker's main page, that lists the total
> number of slots for the cluster, and how many are in use.
> "Maps, Reduces, Map Task Capacity, Reduce Task Capacity"
> I usually don't drill down to a per trasktracker basis.
>
> I also look at the running jobs on the same main page, and see the
> allocation of tasks between them.  Clicking down to an individual job lets
> you see the tasks of each job and view their logs.  That is usually
> sufficient (combined with top, mpstat, iostat, etc) to see how the
> scheduling is working and correlate tasks or phases of hadoop with certain
> system behavior.
>
> >> Running the fair scheduler -- or any scheduler -- that can be configured
> to
> >> schedule more than one task per heartbeat can dramatically increase your
> >> slot utilization if it is low.
> >
> > I 've been running a single job - would  the scheduler benefits show
> > up with multiple jobs only, by definition? I am now trying multiple
> > simultaneous sort jobs with smaller disjoint datasets launched in
> > parallel by the same user. Do I need to setup any fairscheduler
> > parameters other than the below?
>
> It can help with only one job as well, due to the 'assignmultiple' feature.
> Often, the default scheduler will assign one task per tasktracker ping.
> This is not enough to keep up, or to ramp up at the start of the job
> quickly.
> This is something you can see in the UI.  If jobs are completing faster
> than
> the scheduler can assign new tasks the slot utilization will be low.
>
> >
> > <property>
> >   <name>mapred.jobtracker.taskScheduler</name>
> >   <value>org.apache.hadoop.mapred.FairScheduler</value>
> > </property>
> >
> > <property>
> >     <name>mapred.fairscheduler.assignmultiple</name>
> >     <value>true</value>
> > </property>
> >
> > <property>
> >   <name>mapred.fairscheduler.allocation.file</name>
> >   <value>/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml</value>
> > </property>
> >
> > fairscheduler.xml is:
> >
> > <?xml version="1.0"?>
> > <allocations>
> >  <pool name="user2">
> >    <minMaps>12</minMaps>
> >    <minReduces>12</minReduces>
> >    <maxRunningJobs>16</maxRunningJobs>
> >    <weight>4</weight>
> >  </pool>
> >  <userMaxJobsDefault>16</userMaxJobsDefault>
> > </allocations>
> >
> > With 4 parallel sort jobs, I am noticing that maps execute in parallel
> > across all jobs.  But reduce tasks are only allocated/executed from a
> > single job at a time, until that job finishes. Is that expected or am
> > I missing something in my fairscheduler (or other) settings?
> >
>
> The config seems fine.
>
> The reduce tasks all belonging to one job seems a bit odd to me, usually
> reduce tasks are distributed amongst jobs as well as map tasks.   However,
> there is another parameter that defines how far along the map phase of a
> job
> has to be before it begins scheduling reduces.  That, combined with how
> many
> total reduce slots you have on your cluster and how many each job is trying
> to create, might create the behavior you are seeing.
>
> I don't think I set the minMaps or minReduces and don't know the defaults.
> Maybe that is affecting it?
>
>
> >> See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and
> my
> >> comment from June 10 2009.
> >>
> >
> > For a single big sort job, I have ~2300maps and 84 reduces on a 7node
> > cluster with 12-core nodes. The thread dumps for my unpatched version
> > also show sleeping threads at fetchOutputs() - I don't know how often
> > you 've seen it in your own task dumps.  From what I understand, what
> > we are looking for in the reduce logs, in terms of a shuffle idle
> > bottleneck, is the time elapsed between the "shuffling" lines to the
> > "in-memory merge complete" lines, does that sound right? With the one
> > line change in fetchOutput(), I 've sometimes seen the average shuffle
> > time across tasks go down by ~5-10% and so has the execution time. But
> > the results are variable across runs, I need to look at the reduce
> > logs and repeat the experiment.
>
> For something like sort, which does not reduce the size of the data prior
> to
> a reduce in a combiner or other filter, I would not expect the same massive
> gains that I did in that ticket.
> I built a close to worse case example, where each reducer had to fetch 20+
> map fragments from each node where the fetch time for each chunk is a few
> milliseconds. In addition, the remainder of the reduce work is trivial, in
> your case there is a lot of merging and spilling to do.  This will make the
> logs harder to interpret -- there's more going on.
>
> In your 7 node case, with 2300 maps, the default behavior would cause a
> delay of a couple seconds * (2300/7). Is that roughly the 10% you see?
>
> >
> > From your  system description ticket-318 notes, I see you have
> > configured your cluster to do "10 concurrent shuffle copies".  Does
> > this refer to  the parameter mapred.reduce.parallel.copies (default 5)
> > or io.sort.factor (default 10)? I retried with double the
> > reduce.parallel.copies from the default 5 to 10, but it didn't help.
> >
>
> The former.  I also have modified sort factor and sort memory (to 64 and
> 256, I think) but that is tuned to my workload and job types.
>
> Tuning parallel copies is somewhat workload dependent -- For something like
> sort that emits as much data as it ingests, there shouldn't be a large
> effect.
>
> >> My summary is that the hadoop scheduling process has not been so far for
> >> servers that can run more than 6 or so tasks at once.  A server capable
> of
> >> running 12 maps is especially prone to running under-utilized.  Many
> changes
> >> in the 0.21 timeframe address some of this.
> >>
> >
> > What levels of utilization have you achieved on servers capable of
> > running 10-12 maps/reduce slots? I understand that this depends on the
> > type and number of jobs. I suspect you 've seen higher utilizations
> > when having more concurrent jobs. Were you using the fairscheduler
> > instead of the default one?
>
> With the default scheduler, and a cluster configured with 13 maps and 11
> reduces per machine, I could not get a sequence of ~30 M/R jobs to utilize
> more than 35% of a node.  (where 100% is measured by either having all
> slots
> full OR being bottlenecked by I/O or CPU).
>
> Using the fair scheduler increased the utilization to 85% + in terms of
> slot
> occupation, with 99%+ when jobs that work on larger data sets were active
> and less when dependent jobs working on smaller files (outputs of previous
> large jobs) are completing more rapidly than the scheduler can keep up.
>
> But after the fair scheduler tuning to increase slot usage, the systems
> were
> frequently not I/O or CPU bound.  The time was often waiting around in the
> shuffle sleeping.  After that fix, CPU usage in those phases increased
> dramatically, and now the job sequence is often CPU bound, with occasional
> I/O bound periods.  The increase in the speed of the reduce due to these
> changes however does lead to some times where the 0.19.x fair scheduler
> can't keep up with the rate of tasks completing, since it can only assign
> one map and one reduce per tasktracker ping.  During ramp-up of jobs with
> more tasks, the utilization is still 70% ish and could be better.
> Later versions fix this, and can assign many tasks of both types, so I
> expect that a later upgrade to 0.21 will have higher utilization still.
>
>
> >
> > If you can suggest any public hadoop examples/benchmarks that allow
> > for the "cascade-type" MR jobs that you refer too, please share.
> >
>
> Most of the public examples don't do anything near as complicated as a real
> world problem with a dependency chain of jobs and intermediate data.  The
> stuff in the PigMix benchmarks do a few individual tasks that might
> represent components of a larger data flow, and are useful as reference for
> more complicated problems than the generic sort, word count, etc.  But they
> are still fairly simple.
> http://wiki.apache.org/pig/PigMix
>
>
>
> > thanks again,
> >
> > - Vasilis
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message