hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6436) JobHistory cache issue
Date Tue, 15 Dec 2015 08:39:46 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhihai xu updated MAPREDUCE-6436:
---------------------------------
    Target Version/s: 2.7.3, 2.6.4
            Priority: Blocker  (was: Major)
         Description: 
Problem: 
HistoryFileManager.addIfAbsent produces large amount of logs if number of
cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
larger than mapreduce.jobhistory.joblist.cache.size by far.

Example:
For example, if the cache contains 50000 entries in total and 10,000 entries
newer than mapreduce.jobhistory.max-age-ms where
mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
JobListCache because it is not in done yet" message.

It will attach a stacktrace.

Impact:
In addition to large disk consumption, this issue blocks JobHistory.getJob
long time and slows job execution down significantly because getJob is called
by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
multiple threads call scanIfNeeded simultaneously, one of them acquires lock
and the other threads are blocked until the first thread completes long-running
HistoryFileManager.addIfAbsent call.

Solution: 
* Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
* Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
  scanning if another thread is already scanning. This changes semantics of
  some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
  because scanIfNeeded keep outdated state.
* Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
  not blocked by a loop at scale of tens of thousands.
 
This patch implemented the first item.


  was:

Problem: 
HistoryFileManager.addIfAbsent produces large amount of logs if number of
cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
larger than mapreduce.jobhistory.joblist.cache.size by far.

Example:
For example, if the cache contains 50000 entries in total and 10,000 entries
newer than mapreduce.jobhistory.max-age-ms where
mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
JobListCache because it is not in done yet" message.

It will attach a stacktrace.

Impact:
In addition to large disk consumption, this issue blocks JobHistory.getJob
long time and slows job execution down significantly because getJob is called
by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
multiple threads call scanIfNeeded simultaneously, one of them acquires lock
and the other threads are blocked until the first thread completes long-running
HistoryFileManager.addIfAbsent call.

Solution: 
* Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
* Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
  scanning if another thread is already scanning. This changes semantics of
  some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
  because scanIfNeeded keep outdated state.
* Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
  not blocked by a loop at scale of tens of thousands.
 
This patch implemented the first item.



> JobHistory cache issue
> ----------------------
>
>                 Key: MAPREDUCE-6436
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6436
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Ryu Kobayashi
>            Assignee: Kai Sasaki
>            Priority: Blocker
>         Attachments: MAPREDUCE-6436.1.patch, MAPREDUCE-6436.2.patch, MAPREDUCE-6436.3.patch,
MAPREDUCE-6436.4.patch, stacktrace1.txt, stacktrace2.txt, stacktrace3.txt
>
>
> Problem: 
> HistoryFileManager.addIfAbsent produces large amount of logs if number of
> cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
> larger than mapreduce.jobhistory.joblist.cache.size by far.
> Example:
> For example, if the cache contains 50000 entries in total and 10,000 entries
> newer than mapreduce.jobhistory.max-age-ms where
> mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
> method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
> JobListCache because it is not in done yet" message.
> It will attach a stacktrace.
> Impact:
> In addition to large disk consumption, this issue blocks JobHistory.getJob
> long time and slows job execution down significantly because getJob is called
> by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
> This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
> eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
> multiple threads call scanIfNeeded simultaneously, one of them acquires lock
> and the other threads are blocked until the first thread completes long-running
> HistoryFileManager.addIfAbsent call.
> Solution: 
> * Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long
time.
> * Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
>   scanning if another thread is already scanning. This changes semantics of
>   some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
>   because scanIfNeeded keep outdated state.
> * Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
>   not blocked by a loop at scale of tens of thousands.
>  
> This patch implemented the first item.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message