hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tony Wu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
Date Thu, 15 Oct 2015 17:17:05 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tony Wu updated HADOOP-12482:
-----------------------------
    Description: 
updateJmxCache() was updated in HADOOP-11301. However the patch introduced a race condition.
In updateJmxCache() function in MetricsSourceAdapter.java:

{code:java}
  private void updateJmxCache() {
    boolean getAllMetrics = false;
    synchronized (this) {
      if (Time.now() - jmxCacheTS >= jmxCacheTTL) {
        // temporarilly advance the expiry while updating the cache
        jmxCacheTS = Time.now() + jmxCacheTTL;
        if (lastRecs == null) {
          getAllMetrics = true;
        }
      } else {
        return;
      }

      if (getAllMetrics) {
        MetricsCollectorImpl builder = new MetricsCollectorImpl();
        getMetrics(builder, true);
      }

      updateAttrCache();
      if (getAllMetrics) {
        updateInfoCache();
      }
      jmxCacheTS = Time.now();
      lastRecs = null; // in case regular interval update is not running
    }
  }
{code}

Notice that getAllMetrics is set to true when:
# jmxCacheTTL has passed
# lastRecs == null

lastRecs is set to null in the same function, but gets reassigned by getMetrics().

However getMetrics() can be called from a different thread:
# MetricsSystemImpl.onTimerEvent()
# MetricsSystemImpl.publishMetricsNow()

Consider the following sequence:
# updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. 
** lastRecs is set to null.
# metrics sources is updated with new value/field.
# getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread
getting the latest metrics. 
** lastRecs is updated (!= null).
# jmxCacheTTL passed.
# updateJmxCache() is called again via getMBeanInfo().
** However because lastRecs is already updated (!= null), getAllMetrics will not be set to
true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info.

We ran into this issue on a cluster where a new metric did not get published until much later.

The case can be made worse by a periodic call to getMetrics() (driven by an external program
or script). In such case getMBeanInfo() may never be able to retrieve the new record.

The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache()
once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself.

  was:
updateJmxCache() was in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache()
function in MetricsSourceAdapter.java:

{code:java}
  private void updateJmxCache() {
    boolean getAllMetrics = false;
    synchronized (this) {
      if (Time.now() - jmxCacheTS >= jmxCacheTTL) {
        // temporarilly advance the expiry while updating the cache
        jmxCacheTS = Time.now() + jmxCacheTTL;
        if (lastRecs == null) {
          getAllMetrics = true;
        }
      } else {
        return;
      }

      if (getAllMetrics) {
        MetricsCollectorImpl builder = new MetricsCollectorImpl();
        getMetrics(builder, true);
      }

      updateAttrCache();
      if (getAllMetrics) {
        updateInfoCache();
      }
      jmxCacheTS = Time.now();
      lastRecs = null; // in case regular interval update is not running
    }
  }
{code}

Notice that getAllMetrics is set to true when:
# jmxCacheTTL has passed
# lastRecs == null

lastRecs is set to null in the same function, but gets reassigned by getMetrics().

However getMetrics() can be called from a different thread:
# MetricsSystemImpl.onTimerEvent()
# MetricsSystemImpl.publishMetricsNow()

Consider the following sequence:
# updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. 
** lastRecs is set to null.
# metrics sources is updated with new value/field.
# getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread
getting the latest metrics. 
** lastRecs is updated (!= null).
# jmxCacheTTL passed.
# updateJmxCache() is called again via getMBeanInfo().
** However because lastRecs is already updated (!= null), getAllMetrics will not be set to
true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info.

We ran into this issue on a cluster where a new metric did not get published until much later.

The case can be made worse by a periodic call to getMetrics() (driven by an external program
or script). In such case getMBeanInfo() may never be able to retrieve the new record.

The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache()
once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself.


> Race condition in JMX cache update
> ----------------------------------
>
>                 Key: HADOOP-12482
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12482
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Tony Wu
>            Assignee: Tony Wu
>         Attachments: HADOOP-12482.001.patch
>
>
> updateJmxCache() was updated in HADOOP-11301. However the patch introduced a race condition.
In updateJmxCache() function in MetricsSourceAdapter.java:
> {code:java}
>   private void updateJmxCache() {
>     boolean getAllMetrics = false;
>     synchronized (this) {
>       if (Time.now() - jmxCacheTS >= jmxCacheTTL) {
>         // temporarilly advance the expiry while updating the cache
>         jmxCacheTS = Time.now() + jmxCacheTTL;
>         if (lastRecs == null) {
>           getAllMetrics = true;
>         }
>       } else {
>         return;
>       }
>       if (getAllMetrics) {
>         MetricsCollectorImpl builder = new MetricsCollectorImpl();
>         getMetrics(builder, true);
>       }
>       updateAttrCache();
>       if (getAllMetrics) {
>         updateInfoCache();
>       }
>       jmxCacheTS = Time.now();
>       lastRecs = null; // in case regular interval update is not running
>     }
>   }
> {code}
> Notice that getAllMetrics is set to true when:
> # jmxCacheTTL has passed
> # lastRecs == null
> lastRecs is set to null in the same function, but gets reassigned by getMetrics().
> However getMetrics() can be called from a different thread:
> # MetricsSystemImpl.onTimerEvent()
> # MetricsSystemImpl.publishMetricsNow()
> Consider the following sequence:
> # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. 
> ** lastRecs is set to null.
> # metrics sources is updated with new value/field.
> # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread
getting the latest metrics. 
> ** lastRecs is updated (!= null).
> # jmxCacheTTL passed.
> # updateJmxCache() is called again via getMBeanInfo().
> ** However because lastRecs is already updated (!= null), getAllMetrics will not be set
to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info.
> We ran into this issue on a cluster where a new metric did not get published until much
later.
> The case can be made worse by a periodic call to getMetrics() (driven by an external
program or script). In such case getMBeanInfo() may never be able to retrieve the new record.
> The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache()
once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message