Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1894110E2A for ; Wed, 5 Jun 2013 00:34:21 +0000 (UTC) Received: (qmail 50285 invoked by uid 500); 5 Jun 2013 00:34:20 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 50257 invoked by uid 500); 5 Jun 2013 00:34:20 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 50248 invoked by uid 99); 5 Jun 2013 00:34:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 00:34:20 +0000 Date: Wed, 5 Jun 2013 00:34:20 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-9618) Add thread which detects JVM pauses MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675471#comment-13675471 ] Todd Lipcon commented on HADOOP-9618: ------------------------------------- bq. I kind of wish we could use the JVM's Xloggc:logfile to get this information, since theoretically it should be more trustworthy than trying to guess. Is that too much hassle to configure by default? The problem is that the GC logs don't roll, plus it's difficult to correlate that into the log4j stream, since the timestamps in the GC logs are different format than log4j, etc -- plus they won't rollup through alternate log4j appenders to centralized monitoring. bq. I suppose the thread method detects machine pauses which are not the result of GCs, so you could say that it gives more information (although perhaps more questionable information). Yep - I've seen cases where the kernel locks up for multiple seconds due to some bug, and that's interesting. Also there's JVM "safepoint pauses" which are nasty and aren't in the gc logs unless you use -XX:+PrintSafepointStatistics, which is super verbose. bq. I'm a little gun-shy of the 1 second timeout. It wasn't too long ago that the Linux scheduler quantum was 100 milliseconds. So if you had ten threads hogging the CPU, you'd already have no time left to run your watchdog thread. I think the timeout either needs to be longer, or the thread needs to be a high-priority thread, possibly even realtime priority. If one of your important Hadoop daemons is so overloaded, I think that would be interesting as well. This only logs if the 1-second pause takes 3 seconds, so things like scheduling jitter won't cause log messages unless the "jitter" is multiple seconds long. At that point, I'd want to know about it regardless of whether it's GC, a kernel issue, contention for machine resources, swap, etc. Do you disagree? > Add thread which detects JVM pauses > ----------------------------------- > > Key: HADOOP-9618 > URL: https://issues.apache.org/jira/browse/HADOOP-9618 > Project: Hadoop Common > Issue Type: New Feature > Components: util > Affects Versions: 3.0.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hadoop-9618.txt > > > Often times users struggle to understand what happened when a long JVM pause (GC or otherwise) causes things to malfunction inside a Hadoop daemon. For example, a long GC pause while logging an edit to the QJM may cause the edit to timeout, or a long GC pause may make other IPCs to the NameNode timeout. We should add a simple thread which loops on 1-second sleeps, and if the sleep ever takes significantly longer than 1 second, log a WARN. This will make GC pauses obvious in logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira