hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6622) Add capability to set JHS job cache to a task-based limit
Date Sat, 27 Feb 2016 09:52:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170516#comment-15170516

zhihai xu commented on MAPREDUCE-6622:

This patch also fixed a memory leak issue due to a race condition at {{CachedHistoryStorage.getFullJob}},
We can reproduce this memory leak issue by keeping refreshing the JHS web page for a job with
more than 40,000 mappers quickly. The race condition is {{fileInfo.loadJob()}} takes long
time to load the job with more than 40000 mappers, during that time, {{fileInfo.loadJob()}}
is called multiple times for the same job because no synchronization between {{loadedJobCache.get(jobId)}}
and {{loadJob(fileInfo)}}. You will see the used heap memory quickly go up. Looked at the
heap dump, we find 56 {{CompletedJob}} instances for the same job ID, which have total more
2 million mappers(56*40000). Based on the link from http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/cache/CacheBuilder.html#build(com.google.common.cache.CacheLoader)
This won't be an issue for com.google.common.cache.LoadingCache:
If another thread is currently loading the value for this key, simply waits for that thread
to finish and returns its loaded value
This looks like a critical issue for me. Should we backport this patch to 2.7.3 and 2.6.5

> Add capability to set JHS job cache to a task-based limit
> ---------------------------------------------------------
>                 Key: MAPREDUCE-6622
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6622
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobhistoryserver
>    Affects Versions: 2.7.2
>            Reporter: Ray Chiang
>            Assignee: Ray Chiang
>              Labels: supportability
>             Fix For: 2.9.0
>         Attachments: MAPREDUCE-6622.001.patch, MAPREDUCE-6622.002.patch, MAPREDUCE-6622.003.patch,
MAPREDUCE-6622.004.patch, MAPREDUCE-6622.005.patch, MAPREDUCE-6622.006.patch, MAPREDUCE-6622.007.patch,
MAPREDUCE-6622.008.patch, MAPREDUCE-6622.009.patch, MAPREDUCE-6622.010.patch, MAPREDUCE-6622.011.patch,
MAPREDUCE-6622.012.patch, MAPREDUCE-6622.013.patch, MAPREDUCE-6622.014.patch
> When setting the property mapreduce.jobhistory.loadedjobs.cache.size the jobs can be
of varying size.  This is generally not a problem when the jobs sizes are uniform or small,
but when the job sizes can be very large (say greater than 250k tasks), then the JHS heap
size can grow tremendously.
> In cases, where multiple jobs are very large, then the JHS can lock up and spend all
its time in GC.  However, since the cache is holding on to all the jobs, not much heap space
can be freed up.
> By setting a property that sets a cap on the number of tasks allowed in the cache and
since the total number of tasks loaded is directly proportional to the amount of heap used,
this should help prevent the JHS from locking up.

This message was sent by Atlassian JIRA

View raw message