hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6684) High contention on scanning of user directory under immediate_done in Job History Server
Date Tue, 26 Apr 2016 14:42:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258190#comment-15258190
] 

Jason Lowe commented on MAPREDUCE-6684:
---------------------------------------

We need to get to the bottom of why the lock is being held so long. The move itself is done
via an executor pool so that shouldn't be slow.

According to the jstacks, every single one has this stacktrace for the thread that has the
lock on the user's intermediate directory:
{noformat}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.didMoveFail(HistoryFileManager.java:356)
        - waiting to lock <0x000000056106dd80> (a org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:873)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:315)
        - locked <0x00000004fae96ee0> (a org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:839)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(HistoryFileManager.java:978)
        at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:174)
[...]
{noformat}

So the thread has the lock, but it's waiting a long time on some other thread which isn't
ideal.  Every jstack shows the other thread doing this:
{noformat}
"473394034@qtp-1617102129-245" daemon prio=10 tid=0x00007f66c8185000 nid=0x6437 runnable [0x00007f6693b6c000]
   java.lang.Thread.State: RUNNABLE
      at java.util.HashMap.put(HashMap.java:494)
      at org.apache.avro.util.WeakIdentityHashMap.put(WeakIdentityHashMap.java:108)
      at org.apache.avro.LogicalTypes.fromSchemaIgnoreInvalid(LogicalTypes.java:62)
      at org.apache.avro.Schema.parse(Schema.java:1318)
      at org.apache.avro.Schema.parse(Schema.java:1260)
      at org.apache.avro.Schema.parse(Schema.java:1331)
      at org.apache.avro.Schema.parse(Schema.java:1260)
      at org.apache.avro.Schema$Parser.parse(Schema.java:1024)
      at org.apache.avro.Schema$Parser.parse(Schema.java:1012)
      at org.apache.avro.Schema.parse(Schema.java:1064)
      at org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71)
      at org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139)
      - locked <0x00000005a05fd4e8> (a org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser)
      at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.loadFullHistoryData(CompletedJob.java:339)
      - locked <0x0000000577bd9b98> (a org.apache.hadoop.mapreduce.v2.hs.CompletedJob)
      at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.<init>(CompletedJob.java:101)
      at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.loadJob(HistoryFileManager.java:451)
      - locked <0x000000056106dd80> (a org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
[...]
{noformat}

So I don't think any of the above three proposals will fix the issue completely.  The core
issue is the scan being held up for an arbitrarily long jhist parse.  I suspect things will
get a _lot_ better if we can decouple these two processes.  I haven't gone through all the
race conditions, but I'm wondering if we can simply make the state field volatile and allow
the didMoveFail and isMovePending methods to avoid grabbing the lock.

> High contention on scanning of user directory under immediate_done in Job History Server
> ----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6684
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.7.0
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>         Attachments: jhs-jstacks-service-monitor-running.tar.gz, jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each user directory
it tries to scan (move or delete files under the user directory as necessary). This method
is called in a thread in JobHistory that performs periodical scanning of intermediate directory,
and can also be called by web server threads for each Web API call made by a JHS client. In
cases where there are many concurrent Web API calls/connections to JHS, all but one thread
are blocked on the lock on the user directory. Eventually, client connects will time out,
but the threads in JHS will not be killed and leave a lot of TCP connections in CLOSE_WAIT
state. 
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp        0      0 10.17.202.19:10020          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.198.30:33010          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.200.30:33980          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.10:59625          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.13:35765          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:10033          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:19888          0.0.0.0:*                   LISTEN  
   63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:35103          ESTABLISHED
63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:43670          ESTABLISHED
63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:45453          ESTABLISHED
63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:49184          ESTABLISHED
63729/java          
> tcp        1      0 10.17.202.19:19888          10.17.202.13:49992          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52703          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52707          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52708          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52710          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52714          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52723          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52726          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52727          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52739          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52749          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52753          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52757          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52760          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52820          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52827          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52829          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52831          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52833          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52836          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52839          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52841          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52843          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52850          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52860          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52876          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52879          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52881          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52884          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52886          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52888          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52891          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52893          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52896          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52898          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52899          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52902          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52909          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52910          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52912          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52923          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52925          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52927          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52930          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52937          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52939          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52945          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52947          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52969          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52972          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52975          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53004          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53007          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53009          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53011          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53052          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53058          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53059          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53063          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53071          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53084          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53093          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53095          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53097          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53101          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53104          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53106          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53108          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53110          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53112          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53114          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53115          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53117          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53121          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53123          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53125          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53127          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53129          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53131          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53134          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53138          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53140          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53153          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53155          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53157          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53159          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53173          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53176          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53177          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53178          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53179          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53181          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53183          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53201          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53204          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53218          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53267          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53270          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53275          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53278          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53280          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53283          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53293          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53296          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53299          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53309          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53312          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53314          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53317          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53320          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53322          CLOSE_WAIT
 63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53338          CLOSE_WAIT
 63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53340          CLOSE_WAIT
 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53364          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53366          ESTABLISHED
63729/java          
> tcp      260      0 10.17.202.19:19888          10.17.202.13:53367          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53380          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53382          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53386          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53390          ESTABLISHED
63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53392          ESTABLISHED
63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45301          CLOSE_WAIT
 63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45303          CLOSE_WAIT
 63729/java          
> tcp     1277      0 10.17.202.19:19888          10.17.202.18:45306          ESTABLISHED
63729/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message