Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E24D18915 for ; Tue, 27 Oct 2015 16:04:28 +0000 (UTC) Received: (qmail 25920 invoked by uid 500); 27 Oct 2015 16:04:28 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 25843 invoked by uid 500); 27 Oct 2015 16:04:28 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 25801 invoked by uid 99); 27 Oct 2015 16:04:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Oct 2015 16:04:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D7A082C1F5A for ; Tue, 27 Oct 2015 16:04:27 +0000 (UTC) Date: Tue, 27 Oct 2015 16:04:27 +0000 (UTC) From: "Tony Wu (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-12482) Race condition in JMX cache update MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976619#comment-14976619 ] Tony Wu commented on HADOOP-12482: ---------------------------------- Manually ran the failed tests on Linux using JDK 1.7, all tests pass without error. > Race condition in JMX cache update > ---------------------------------- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.7.1 > Reporter: Tony Wu > Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will not be set to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an external program or script). In such case getMBeanInfo() may never be able to retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)