Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 66359 invoked from network); 20 Jan 2009 17:55:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Jan 2009 17:55:30 -0000 Received: (qmail 6621 invoked by uid 500); 20 Jan 2009 17:55:23 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 6583 invoked by uid 500); 20 Jan 2009 17:55:23 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 6572 invoked by uid 99); 20 Jan 2009 17:55:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2009 09:55:23 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2009 17:55:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 57BAD234C4B1 for ; Tue, 20 Jan 2009 09:55:02 -0800 (PST) Message-ID: <259688386.1232474102358.JavaMail.jira@brutus> Date: Tue, 20 Jan 2009 09:55:02 -0800 (PST) From: "Devaraj Das (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4766) Hadoop performance degrades significantly as more and more jobs complete In-Reply-To: <1471723864.1228347704179.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665484#action_12665484 ] Devaraj Das commented on HADOOP-4766: ------------------------------------- The thing that worries me about the existing patch is that it is not at all predictable how many jobs/tasks would be there in memory at any point. In my experiments with this patch and a standalone program simulating the same behavior as what the patch is trying to do, I saw that even after purging all the jobs, the memory usage as per Runtime.totalMemory - Runtime.freeMemory didn't come down for quite a while and the thread was trying to free up memory needlessly (note that things like whether incremental GC is in use would also influence this behavior). The approach of basing things on keeping at most 'n' completed tasks in memory at least leads to much more predictability. True that we don't know that the exact memory consumed by a TIP but we can make a good estimate there and tweak the value of the max tasks in memory if need be. Also, in the current patch, the configuration to do with the memory usage threshold is equally dependent on estimation. I am not sure what the threshold should be - should it be 0.75 or 0.9 or 0.8.. Why do you say it is an overkill - i thought basing things on estimating total memory usage is more trickier. Basing it on number of completed tasks seems very similar to the "number of completed jobs" that we currently have. It's just that we are stepping one level below and specifying a value for something the base size of which is going to always remain in control. Also, completed jobs should be treated as one unit w.r.t removal. For example, if the value configured for the max tasks is 1000, and we have a job with 1100 tasks, the entire job should be removed (as opposed to removing only 1000 tasks of the job), keeping this whole thing really simple. Again, this is a short term fix until we move to the model of having a separate History server process. > Hadoop performance degrades significantly as more and more jobs complete > ------------------------------------------------------------------------ > > Key: HADOOP-4766 > URL: https://issues.apache.org/jira/browse/HADOOP-4766 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.18.2, 0.19.0 > Reporter: Runping Qi > Assignee: Amar Kamat > Priority: Blocker > Attachments: HADOOP-4766-v1.patch, HADOOP-4766-v2.10.patch, HADOOP-4766-v2.4.patch, HADOOP-4766-v2.6.patch, HADOOP-4766-v2.7-0.18.patch, HADOOP-4766-v2.7-0.19.patch, HADOOP-4766-v2.7.patch, HADOOP-4766-v2.8-0.18.patch, HADOOP-4766-v2.8-0.19.patch, HADOOP-4766-v2.8.patch, map_scheduling_rate.txt > > > When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with hadoop trunk, > the gridmix load, consisting of 202 map/reduce jobs of various sizes, completed in 32 minutes. > Then I ran the same set of the jobs on the same cluster, yhey completed in 43 minutes. > When I ran them the third times, it took (almost) forever --- the job tracker became non-responsive. > The job tracker's heap size was set to 2GB. > The cluster is configured to keep up to 500 jobs in memory. > The job tracker kept one cpu busy all the time. Look like it was due to GC. > I believe the release 0.18/0.19 have the similar behavior. > I believe 0.18 and 0.18 also have the similar behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.