Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <259688386.1232474102358.JavaMail.jira@brutus>
Date: Tue, 20 Jan 2009 09:55:02 -0800 (PST)
From: "Devaraj Das (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-4766) Hadoop performance degrades
 significantly as more and more jobs complete
In-Reply-To: <1471723864.1228347704179.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665484#action_12665484 ] 

Devaraj Das commented on HADOOP-4766:
-------------------------------------

The thing that worries me about the existing patch is that it is not at all predictable how many jobs/tasks would be there in memory at any point. In my experiments with this patch and a standalone program simulating the same behavior as what the patch is trying to do, I saw that even after purging all the jobs, the memory usage as per Runtime.totalMemory - Runtime.freeMemory didn't come down for quite a while and the thread was trying to free up memory needlessly (note that things like whether incremental GC is in use would also influence this behavior).
The approach of basing things on keeping at most 'n' completed tasks in memory at least leads to much more predictability. True that we don't know that the exact memory consumed by a TIP but we can make a good estimate there and tweak the value of the max tasks in memory if need be. Also, in the current patch, the configuration to do with the memory usage threshold is equally dependent on estimation. I am not sure what the threshold should be - should it be 0.75 or 0.9 or 0.8..
Why do you say it is an overkill - i thought basing things on estimating total memory usage is more trickier. Basing it on number of completed tasks seems very similar to the "number of completed jobs" that we currently have. It's just that we are stepping one level below and specifying a value for something the base size of which is going to always remain in control. Also, completed jobs should be treated as one unit w.r.t removal. For example, if the value configured for the max tasks is 1000, and we have a job with 1100 tasks, the entire job should be removed (as opposed to removing only 1000 tasks of the job), keeping this whole thing really simple.
Again, this is a short term fix until we move to the model of having a separate History server process.

> Hadoop performance degrades significantly as more and more jobs complete
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-4766
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4766
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.2, 0.19.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Blocker
>         Attachments: HADOOP-4766-v1.patch, HADOOP-4766-v2.10.patch, HADOOP-4766-v2.4.patch, HADOOP-4766-v2.6.patch, HADOOP-4766-v2.7-0.18.patch, HADOOP-4766-v2.7-0.19.patch, HADOOP-4766-v2.7.patch, HADOOP-4766-v2.8-0.18.patch, HADOOP-4766-v2.8-0.19.patch, HADOOP-4766-v2.8.patch, map_scheduling_rate.txt
>
>
> When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with hadoop trunk, 
> the gridmix load, consisting of 202 map/reduce jobs of various sizes, completed in 32 minutes. 
> Then I ran the same set of the jobs on the same cluster, yhey completed in 43 minutes.
> When I ran them the third times, it took (almost) forever --- the job tracker became non-responsive.
> The job  tracker's heap size was set to 2GB. 
> The cluster is configured to keep up to 500 jobs in memory.
> The job tracker kept one cpu busy all the time. Look like it was due to GC.
> I believe the release 0.18/0.19 have the similar behavior.
> I believe 0.18 and 0.18 also have the similar behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.