Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B55818B1B for ; Tue, 10 Nov 2015 08:22:16 +0000 (UTC) Received: (qmail 4772 invoked by uid 500); 10 Nov 2015 08:22:11 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 4718 invoked by uid 500); 10 Nov 2015 08:22:11 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 4703 invoked by uid 99); 10 Nov 2015 08:22:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2015 08:22:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 02F562C1F60 for ; Tue, 10 Nov 2015 08:22:11 +0000 (UTC) Date: Tue, 10 Nov 2015 08:22:11 +0000 (UTC) From: "Hudson (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-12482) Race condition in JMX cache update MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998238#comment-14998238 ] Hudson commented on HADOOP-12482: --------------------------------- FAILURE: Integrated in Hadoop-Yarn-trunk #1385 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1385/]) HADOOP-12482. Race condition in JMX cache update. (Tony Wu via lei) (lei: rev 0eb9c60c5bec79f531da8cb3226d7e8b1d7e6639) * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/metrics2/impl/TestMetricsSourceAdapter.java * hadoop-common-project/hadoop-common/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsSourceAdapter.java > Race condition in JMX cache update > ---------------------------------- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.7.1 > Reporter: Tony Wu > Assignee: Tony Wu > Fix For: 2.8.0, 3.0.0 > > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, HADOOP-12482.003.patch, HADOOP-12482.004.patch, HADOOP-12482.005.patch, HADOOP-12482.006.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will not be set to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an external program or script). In such case getMBeanInfo() may never be able to retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)