Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B232DF9A for ; Fri, 14 Dec 2012 01:46:57 +0000 (UTC) Received: (qmail 15951 invoked by uid 500); 14 Dec 2012 01:46:52 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 15862 invoked by uid 500); 14 Dec 2012 01:46:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 15849 invoked by uid 99); 14 Dec 2012 01:46:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Dec 2012 01:46:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jieli@cs.duke.edu designates 152.3.140.1 as permitted sender) Received: from [152.3.140.1] (HELO duke.cs.duke.edu) (152.3.140.1) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Dec 2012 01:46:47 +0000 Received: from mail-vc0-f176.google.com (mail-vc0-f176.google.com [209.85.220.176]) (authenticated bits=0) by duke.cs.duke.edu (8.14.5/8.14.5) with ESMTP id qBE1kPEp010338 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Thu, 13 Dec 2012 20:46:26 -0500 (EST) X-DKIM: Sendmail DKIM Filter v2.8.3 duke.cs.duke.edu qBE1kPEp010338 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=cs.duke.edu; s=mail; t=1355449586; bh=BMxcWLRwZ1PUW/akUf5afeE1k9TedP3EwXLTmFTpL1o=; h=MIME-Version:In-Reply-To:References:Date:Message-ID:Subject:From: To:Content-Type; b=DjrC06c7YdZQ0qy4WDGpcldGXQAVh6IjBSZ8oMp4BFY+sfmkC1HC1YuwTZmSrjxvH Tbz8eLBfEOfqrclILvdoch6mEGpIrfMrhnWUp+1lRP0Dj/EMOLiLMRiRWeXsJUsKLI rEexyie7xBEX/mmHX1AzFJLt/GXPP0a/TMinDNj67rcC7Yj3iK9Na3PiGz6W0WQsjw +KjE/d21jSNHMq2hN6i8HMdbqUbmVJ1oHqJLKh0lPqF7tHKF8R5RU1Sk/mO7AYL3ny azBCeQ7Iv1g6KgmAE29Rj1NSsXBsu4Qinrq1BV6B+GG36oX/a6Z7pvXaLw6U/VpcA5 OrpNl6ykCAeWA== Received: by mail-vc0-f176.google.com with SMTP id fo13so3338136vcb.35 for ; Thu, 13 Dec 2012 17:46:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.149.17 with SMTP id r17mr6960273vcv.0.1355449585627; Thu, 13 Dec 2012 17:46:25 -0800 (PST) Received: by 10.220.249.201 with HTTP; Thu, 13 Dec 2012 17:46:25 -0800 (PST) In-Reply-To: References: Date: Thu, 13 Dec 2012 20:46:25 -0500 Message-ID: Subject: Re: Hadoop 1.0.4 Performance Problem From: Jie Li To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Jon, Thanks for sharing these insights! Can't agree with you more! Recently we released a tool called Starfish Hadoop Log Analyzer for analyzing the job histories. I believe it can quickly point out this reduce problem you met! http://www.cs.duke.edu/starfish/release.html Jie On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen wrote: > Jie, > > Simple answer - I got lucky (though obviously there are thing you need to > have in place to allow you to be lucky). > > Before running the upgrade I ran a set of tests to baseline the cluster > performance, e.g. terasort, gridmix and some operational jobs. Terasort by > itself isn't very realistic as a cluster test but it's nice and simple to > run and is good for regression testing things after a change. > > After the upgrade the intention was to run the same tests and show that the > performance hadn't degraded (improved would have been nice but not worse was > the minimum). When we ran the terasort we found that performance was about > 50% worse - execution time had gone from 40 minutes to 60 minutes. As I've > said, terasort doesn't provide a realistic view of operational performance > but this showed that something major had changed and we needed to understand > it before going further. So how to go about diagnosing this ... > > First rule - understand what you're trying to achieve. It's very easy to > say performance isn't good enough but performance can always be better so > you need to know what's realistic and at what point you're going to stop > tuning things. I had a previous baseline that I was trying to match so I > knew what I was trying to achieve. > > Next thing to do is profile your job and identify where the problem is. We > had the full job history from the before and after jobs and comparing these > we saw that map performance was fairly consistent as were the reduce sort > and reduce phases. The problem was with the shuffle, which had gone from 20 > minutes pre-upgrade to 40 minutes afterwards. The important thing here is > to make sure you've got as much information as possible. If we'd just kept > the overall job time then there would have been a lot more areas to look at > but knowing the problem was with shuffle allowed me to focus effort in this > area. > > So what had changed in the shuffle that may have slowed things down. The > first thing we thought of was that we'd moved from a tarball deployment to > using the RPM so what effect might this have had on things. Our operational > configuration compresses the map output and in the past we've had problems > with Java compression libraries being used rather than native ones and this > has affected performance. We knew the RPM deployment had moved the native > library so spent some time confirming to ourselves that these were being > used correctly (but this turned out to not be the problem). We then spent > time doing some process and server profiling - using dstat to look at the > server bottlenecks and jstack/jmap to check what the task tracker and reduce > processes were doing. Although not directly relevant to this particular > problem doing this was useful just to get my head around what Hadoop is > doing at various points of the process. > > The next bit was one place where I got lucky - I happened to be logged onto > one of the worker nodes when a test job was running and I noticed that there > weren't any reduce tasks running on the server. This was odd as we'd > submitted more reducers than we have servers so I'd expected at least one > task to be running on each server. Checking the job tracker log file it > turned out that since the upgrade the job tracker had been submitting reduce > tasks to only 10% of the available nodes. A different 10% each time the job > was run so clearly the individual task trackers were working OK but there > was something odd going on with the task allocation. Checking the job > tracker log file showed that before the upgrade tasks had been fairly evenly > distributed so something had changed. After that it was a case of digging > around the source code to find out which classes were available for task > allocation and what inside them had changed. This can be quite daunting but > if you're comfortable with Java then it's just a case of following the calls > through the code. Once I found the cause it was just a case of working out > what my options were for working around it (in this case turning off the > multiple assignment option - I can work out whether I want to turn it back > on in slower time). > > Where I think we got very lucky is that we hit this problem. The > configuration we use for the terasort has just over 1 reducer per worker > node rather than maxing out the available reducer slots. This decision was > made several years and I can't remember the reasons for it. If we'd been > using a larger number of reducers then the number of worker nodes in use > would have been similar regardless of the allocation algorithm and so the > performance would have looked similar before and after the upgrade. We > would have hit this problem eventually but probably not until we started > running user jobs and by then it would be too late to do the intrusive > investigations that were possible now. > > Hope this has been useful. > > Regards, > Jon > > > > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li wrote: >> >> Jon: >> >> This is interesting and helpful! How did you figure out the cause? And how >> much time did you spend? Could you share some experience of performance >> diagnosis? >> >> Jie >> >> On Tuesday, November 27, 2012, Harsh J wrote: >>> >>> Hi Amit, >>> >>> The default scheduler is FIFO, and may not work for all forms of >>> workloads. Read the multiple schedulers available to see if they have >>> features that may benefit your workload: >>> >>> Capacity Scheduler: >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html >>> FairScheduler: >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html >>> >>> While there's a good overlap of features between them, there are a few >>> differences that set them apart and make them each useful for >>> different use-cases. If I had to summarize on some such differences, >>> FairScheduler is better suited to SLA form of job execution situations >>> due to its preemptive features (which make it useful in user and >>> service mix scenarios), while CapacityScheduler provides >>> manual-resource-request oriented scheduling for odd jobs with high >>> memory workloads, etc. (which make it useful for running certain >>> specific kind of jobs along side the regular ones). >>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela wrote: >>> > So this is a FairScheduler problem ? >>> > We are using the default Hadoop scheduler. Is there a reason to use the >>> > Fair >>> > Scheduler if most of the time we don't have more than 4 jobs running >>> > simultaneously ? >>> > >>> > >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J wrote: >>> >> >>> >> Hi Amit, >>> >> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler >>> >> property. It is true by default, which works well for most workloads >>> >> if not benchmark style workloads. I would not usually trust that as a >>> >> base perf. measure of everything that comes out of an upgrade. >>> >> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0. >>> >> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela >>> >> wrote: >>> >> > Hi Jon, >>> >> > >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop >>> >> > 1.0.4 >>> >> > and I haven't noticed any performance issues. By "multiple >>> >> > assignment >>> >> > feature" do you mean speculative execution >>> >> > (mapred.map.tasks.speculative.execution and >>> >> > mapred.reduce.tasks.speculative.execution) ? >>> >> > >>> >> > >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen >>> >> > wrote: >>> >> >> >>> >> >> Problem solved, but worth warning others about. >>> >> >> >>> >> >> Before the upgrade the reducers for the terasort process had been >>> >> >> evenly >>> >> >> distributed around the cluster - one per task tracker in turn, >>> >> >> looping >>> >> >> around the cluster until all tasks were allocated. After the >>> >> >> upgrade >>> >> >> all >>> >> >> reduce task had been submitted to small number of task trackers - >>> >> >> submit >>> >> >> tasks until the task tracker slots were full and then move onto the >>> >> >> next >>> >> >> task tracker. Skewing the reducers like this quite clearly hit the >>> >> >> benchmark performance. >>> >> >> >>> >> >> The reason for this turns out to be the fair scheduler rewrite >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour >>> >> >> of >>> >> >> the >>> >> >> assign multiple property. Previously this property caused a single >>> >> >> map >>> >> >> and a >>> >> >> single reduce task to be allocated in a task tracker heartbeat >>> >> >> (rather >>> >> >> than >>> >> >> the default of a map or a reduce). After the upgrade it allocates >>> >> >> as >>> >> >> many >>> >> >> tasks as there are available task slots. Turning off the multiple >>> >> >> assignment feature returned the terasort to its pre-upgrade >>> >> >> performance. >>> >> >> >>> >> >> I can see potential benefits to this change and need to think >>> >> >> through >>> >> >> the >>> >> >> consequences to real world applications (though in practice we're >>> >> >> likely to >>> >> >> move away from fair scheduler due to MAPREDUCE-4451). >>> >> >> Investigating >>> >> >> this >>> >> >> has been a pain so to warn other user is there anywhere central >>> >> >> that >>> >> >> can be >>> >> >> used to record upgrade gotchas like this? >>> >> >> >>> >> >> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen >>> >> >> wrote: >>> >> >>> >>> >> >>> Hi, >>> >> >>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and >>> >> >>> have >>> >> >>> hit performance problems. Before the upgrade a 15TB terasort took >>> >> >>> about 45 >>> >> >>> minutes, afterwards it takes just over an hour. Looking in more >>> >> >>> detail it >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40 >>> >> >>> minutes. >>> >> >>> Does >>> >> >>> anyone have any thoughts about what'-- >>> Harsh J >>> >