hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses
Date Tue, 04 Jun 2013 23:44:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675436#comment-13675436
] 

Colin Patrick McCabe commented on HADOOP-9618:
----------------------------------------------

I kind of wish we could use the JVM's {{Xloggc:logfile}} to get this information, since theoretically
it should be more trustworthy than trying to guess.  Is that too much hassle to configure
by default?

I suppose the thread method detects machine pauses which are *not* the result of GCs, so you
could say that it gives more information (although perhaps more questionable information).

I'm a little gun-shy of the 1 second timeout.  It wasn't too long ago that the Linux scheduler
quantum was 100 milliseconds.  So if you had ten threads hogging the CPU, you'd already have
no time left to run your watchdog thread.  I think the timeout either needs to be longer,
or the thread needs to be a high-priority thread, possibly even realtime priority.

Have you tried running this with a gnarly MapReduce job going on?
                
> Add thread which detects JVM pauses
> -----------------------------------
>
>                 Key: HADOOP-9618
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9618
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: util
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-9618.txt
>
>
> Often times users struggle to understand what happened when a long JVM pause (GC or otherwise)
causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging
an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to
the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if
the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses
obvious in logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message