hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: hadoop idle time on terasort
Date Wed, 09 Dec 2009 19:10:54 GMT
On Wed, Dec 9, 2009 at 2:00 PM, Scott Carey <scott@richrelevance.com> wrote:
> On 12/8/09 1:24 PM, "Vasilis Liaskovitis" <vliaskov@gmail.com> wrote:
>> Hi Scott,
>> thanks for the extra tips, these are very helpful.
>> I think the slots are being highly utilized, but I seem to have
>> forgotten which option in the web UI allows you to look at the slot
>> allocations during runtime on each tasktracker. Is this available on
>> the job details from the jobtracker's ui or somewhere else?
> I usually just look at the jobtracker's main page, that lists the total
> number of slots for the cluster, and how many are in use.
> "Maps, Reduces, Map Task Capacity, Reduce Task Capacity"
> I usually don't drill down to a per trasktracker basis.
> I also look at the running jobs on the same main page, and see the
> allocation of tasks between them.  Clicking down to an individual job lets
> you see the tasks of each job and view their logs.  That is usually
> sufficient (combined with top, mpstat, iostat, etc) to see how the
> scheduling is working and correlate tasks or phases of hadoop with certain
> system behavior.
>>> Running the fair scheduler -- or any scheduler -- that can be configured to
>>> schedule more than one task per heartbeat can dramatically increase your
>>> slot utilization if it is low.
>> I 've been running a single job - would  the scheduler benefits show
>> up with multiple jobs only, by definition? I am now trying multiple
>> simultaneous sort jobs with smaller disjoint datasets launched in
>> parallel by the same user. Do I need to setup any fairscheduler
>> parameters other than the below?
> It can help with only one job as well, due to the 'assignmultiple' feature.
> Often, the default scheduler will assign one task per tasktracker ping.
> This is not enough to keep up, or to ramp up at the start of the job
> quickly.
> This is something you can see in the UI.  If jobs are completing faster than
> the scheduler can assign new tasks the slot utilization will be low.
>> <property>
>>   <name>mapred.jobtracker.taskScheduler</name>
>>   <value>org.apache.hadoop.mapred.FairScheduler</value>
>> </property>
>> <property>
>>     <name>mapred.fairscheduler.assignmultiple</name>
>>     <value>true</value>
>> </property>
>> <property>
>>   <name>mapred.fairscheduler.allocation.file</name>
>>   <value>/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml</value>
>> </property>
>> fairscheduler.xml is:
>> <?xml version="1.0"?>
>> <allocations>
>>  <pool name="user2">
>>    <minMaps>12</minMaps>
>>    <minReduces>12</minReduces>
>>    <maxRunningJobs>16</maxRunningJobs>
>>    <weight>4</weight>
>>  </pool>
>>  <userMaxJobsDefault>16</userMaxJobsDefault>
>> </allocations>
>> With 4 parallel sort jobs, I am noticing that maps execute in parallel
>> across all jobs.  But reduce tasks are only allocated/executed from a
>> single job at a time, until that job finishes. Is that expected or am
>> I missing something in my fairscheduler (or other) settings?
> The config seems fine.
> The reduce tasks all belonging to one job seems a bit odd to me, usually
> reduce tasks are distributed amongst jobs as well as map tasks.   However,
> there is another parameter that defines how far along the map phase of a job
> has to be before it begins scheduling reduces.  That, combined with how many
> total reduce slots you have on your cluster and how many each job is trying
> to create, might create the behavior you are seeing.
> I don't think I set the minMaps or minReduces and don't know the defaults.
> Maybe that is affecting it?
>>> See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my
>>> comment from June 10 2009.
>> For a single big sort job, I have ~2300maps and 84 reduces on a 7node
>> cluster with 12-core nodes. The thread dumps for my unpatched version
>> also show sleeping threads at fetchOutputs() - I don't know how often
>> you 've seen it in your own task dumps.  From what I understand, what
>> we are looking for in the reduce logs, in terms of a shuffle idle
>> bottleneck, is the time elapsed between the "shuffling" lines to the
>> "in-memory merge complete" lines, does that sound right? With the one
>> line change in fetchOutput(), I 've sometimes seen the average shuffle
>> time across tasks go down by ~5-10% and so has the execution time. But
>> the results are variable across runs, I need to look at the reduce
>> logs and repeat the experiment.
> For something like sort, which does not reduce the size of the data prior to
> a reduce in a combiner or other filter, I would not expect the same massive
> gains that I did in that ticket.
> I built a close to worse case example, where each reducer had to fetch 20+
> map fragments from each node where the fetch time for each chunk is a few
> milliseconds. In addition, the remainder of the reduce work is trivial, in
> your case there is a lot of merging and spilling to do.  This will make the
> logs harder to interpret -- there's more going on.
> In your 7 node case, with 2300 maps, the default behavior would cause a
> delay of a couple seconds * (2300/7). Is that roughly the 10% you see?
>> From your  system description ticket-318 notes, I see you have
>> configured your cluster to do "10 concurrent shuffle copies".  Does
>> this refer to  the parameter mapred.reduce.parallel.copies (default 5)
>> or io.sort.factor (default 10)? I retried with double the
>> reduce.parallel.copies from the default 5 to 10, but it didn't help.
> The former.  I also have modified sort factor and sort memory (to 64 and
> 256, I think) but that is tuned to my workload and job types.
> Tuning parallel copies is somewhat workload dependent -- For something like
> sort that emits as much data as it ingests, there shouldn't be a large
> effect.
>>> My summary is that the hadoop scheduling process has not been so far for
>>> servers that can run more than 6 or so tasks at once.  A server capable of
>>> running 12 maps is especially prone to running under-utilized.  Many changes
>>> in the 0.21 timeframe address some of this.
>> What levels of utilization have you achieved on servers capable of
>> running 10-12 maps/reduce slots? I understand that this depends on the
>> type and number of jobs. I suspect you 've seen higher utilizations
>> when having more concurrent jobs. Were you using the fairscheduler
>> instead of the default one?
> With the default scheduler, and a cluster configured with 13 maps and 11
> reduces per machine, I could not get a sequence of ~30 M/R jobs to utilize
> more than 35% of a node.  (where 100% is measured by either having all slots
> full OR being bottlenecked by I/O or CPU).
> Using the fair scheduler increased the utilization to 85% + in terms of slot
> occupation, with 99%+ when jobs that work on larger data sets were active
> and less when dependent jobs working on smaller files (outputs of previous
> large jobs) are completing more rapidly than the scheduler can keep up.
> But after the fair scheduler tuning to increase slot usage, the systems were
> frequently not I/O or CPU bound.  The time was often waiting around in the
> shuffle sleeping.  After that fix, CPU usage in those phases increased
> dramatically, and now the job sequence is often CPU bound, with occasional
> I/O bound periods.  The increase in the speed of the reduce due to these
> changes however does lead to some times where the 0.19.x fair scheduler
> can't keep up with the rate of tasks completing, since it can only assign
> one map and one reduce per tasktracker ping.  During ramp-up of jobs with
> more tasks, the utilization is still 70% ish and could be better.
> Later versions fix this, and can assign many tasks of both types, so I
> expect that a later upgrade to 0.21 will have higher utilization still.
>> If you can suggest any public hadoop examples/benchmarks that allow
>> for the "cascade-type" MR jobs that you refer too, please share.
> Most of the public examples don't do anything near as complicated as a real
> world problem with a dependency chain of jobs and intermediate data.  The
> stuff in the PigMix benchmarks do a few individual tasks that might
> represent components of a larger data flow, and are useful as reference for
> more complicated problems than the generic sort, word count, etc.  But they
> are still fairly simple.
> http://wiki.apache.org/pig/PigMix
>> thanks again,
>> - Vasilis

I am not sure if this is something Owen O talked about at Hadoop world
or offline with me. I remember a few things about the Tera sort

One thing was job performance was actually effected by transmitting
the jar files associated with the job. A few posts on list has
suggested patching the TaskTracker startup arguments to include a
local/nfs path where your jar files live. Usually this helps people
that want more real time results, but it could be of help here.

Also optimization was using a very large block size something way over 128 MB.

View raw message