hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6684) High contention on scanning of user directory under immediate_done in Job History Server
Date Fri, 22 Apr 2016 17:51:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254314#comment-15254314
] 

Haibo Chen commented on MAPREDUCE-6684:
---------------------------------------

Thanks a lot for your insight on why intermediated directory is scanned before done directory
and potential name node issue, [~revans2]. That makes a lot of sense. Per offline discussion
with [~karthik.jayaprakasham@ldc.lu.se], we'd like to propose three approaches. 

1. For web API requests for individual jobs, the intermediate directory is still scanned first,
but inside scanIntermediateDir(), we could add checking of existence of the jhst files of
the associated job (), and only when the files do exist do we move files in intermediate directory
to done directory. The assumption is that file existence is not expensive, and if the files
do not exist in intermediate directory, we only acquire the lock on the user directory for
a short period of time.

2. For web API requests of individual jobs, when intermediate directory is scanned, check
the existence of the job files, and only files of the job associated with the request are
moved from intermediate directory to done directory.  This reduces the time for which each
job web request thread blocks, but may have much smaller overall throughput  that the previous
approach when file moving is done in batch.

3. Have a dedicated thread to scan the intermediate directory and other threads to wait on
a monitor associated with a particular job. When the dedicated thread finishes, threads waiting
on the monitors will be notified. By having a single writer, the contention on the user directory
lock can be reduced. But it does have the problem of conflicting with clients' expectation
as [~revans2] pointed out in previous comment.

Can you please share some of your thoughts on them, [~revans2], [~jlowe]?

> High contention on scanning of user directory under immediate_done in Job History Server
> ----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6684
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.7.0
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>         Attachments: jhs-jstacks-service-monitor-running.tar.gz, jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each user directory
it tries to scan (move or delete files under the user directory as necessary). This method
is called in a thread in JobHistory that performs periodical scanning of intermediate directory,
and can also be called by web server threads for each Web API call made by a JHS client. In
cases where there are many concurrent Web API calls/connections to JHS, all but one thread
are blocked on the lock on the user directory. Eventually, client connects will time out,
but the threads in JHS will not be killed and leave a lot of TCP connections in CLOSE_WAIT
state. 
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp        0      0 10.17.202.19:10020          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.198.30:33010          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.200.30:33980          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.10:59625          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.13:35765          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10033          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:19888          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:35103          ESTABLISHED
63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:43670          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:45453          ESTABLISHED
63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:49184          ESTABLISHED
63729/java          
> tcp        1      0 10.17.202.19:19888          10.17.202.13:49992          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52703          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52707          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52708          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52710          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52714          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52723          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52726          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52727          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52739          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52749          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52753          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52757          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52760          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52820          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52827          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52829          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52831          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52833          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52836          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52839          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52841          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52843          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52850          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52860          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52876          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52879          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52881          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52884          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52886          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52888          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52891          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52893          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52896          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52898          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52899          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52902          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52909          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52910          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52912          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52923          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52925          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52927          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52930          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52937          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52939          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52945          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52947          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52969          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52972          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52975          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53004          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53007          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53009          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53011          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53052          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53058          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53059          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53063          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53071          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53084          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53093          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53095          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53097          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53101          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53104          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53106          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53108          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53110          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53112          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53114          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53115          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53117          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53121          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53123          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53125          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53127          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53129          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53131          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53134          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53138          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53140          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53153          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53155          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53157          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53159          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53173          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53176          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53177          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53178          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53179          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53181          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53183          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53201          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53204          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53218          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53267          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53270          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53275          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53278          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53280          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53283          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53293          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53296          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53299          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53309          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53312          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53314          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53317          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53320          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53322          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53338          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53340          CLOSE_WAIT
 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53364          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53366          ESTABLISHED
63729/java          
> tcp      260      0 10.17.202.19:19888          10.17.202.13:53367          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53380          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53382          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53386          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53390          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53392          ESTABLISHED
63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45301          CLOSE_WAIT
 63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45303          CLOSE_WAIT
 63729/java          
> tcp     1277      0 10.17.202.19:19888          10.17.202.18:45306          ESTABLISHED
63729/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message