Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jieli@cs.duke.edu designates
 152.3.140.1 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGpJWjt20Z4rMpH-WWmt1Ctv8iOB-wySf+DmrwKdgCwBhUGjig@mail.gmail.com>
References: <CC997DD6-038F-4E47-A85D-4235C92C470E@gmail.com>
	<CAGpJWjtyddG7cpm1cYoZg+GPeDSFqcjr+jOzyVpW+xuH2X2-kw@mail.gmail.com>
	<CAAMYKhqXsLcUosdXcn3m_5kTJsaD07hQ13qzFrD=7C3eoOOMeA@mail.gmail.com>
	<CAOcnVr2y2H2tkG7xg7X3Jyt3mUOAnB7T1hCBsDYC=Hy=Jr0qrQ@mail.gmail.com>
	<CAAMYKhpwxqNGrRwVLay5WXfibdz793U=B2vqU1Lpk8G-K58uuw@mail.gmail.com>
	<CAOcnVr1gjQwDuYsjYBs0pw5WC-TPJCK4kz2tqfhueOgyFV931g@mail.gmail.com>
	<CALY2=u7tELhR4zSAth8ozK4YxyPU4XFZbkOG6SU796itM_+f5A@mail.gmail.com>
	<CAGpJWjt20Z4rMpH-WWmt1Ctv8iOB-wySf+DmrwKdgCwBhUGjig@mail.gmail.com>
Date: Thu, 13 Dec 2012 20:46:25 -0500
Message-ID: 
 <CALY2=u4vDKg5yzrFO5KWCq8E_MEd_NqcTro+A+XL9YNLC4+duQ@mail.gmail.com>
Subject: Re: Hadoop 1.0.4 Performance Problem
From: Jie Li <jieli@cs.duke.edu>
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <jayayedev@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <jieli@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <amits@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <harsh@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <amits@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayayedev@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>