hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jie Li <ji...@cs.duke.edu>
Subject Re: Hadoop 1.0.4 Performance Problem
Date Thu, 20 Dec 2012 00:27:43 GMT
Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <csmithx@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <jieli@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <jayayedev@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <jieli@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <amits@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to
use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <harsh@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust
that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <amits@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append
to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayayedev@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process
had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker
in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.
 After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of
task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then
move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite
clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler
rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified
the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property
caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker
heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade
it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning
off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need
to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in
practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere
central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <jayayedev@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203
to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade
a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.
 Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20
minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Mime
View raw message