hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9932) Name node crashes due to improper synchronization in RetryCache
Date Wed, 04 Sep 2013 19:10:54 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758206#comment-13758206

Jing Zhao commented on HADOOP-9932:

Also, in RetryCache#addCacheEntry and RetryCache#addCacheEntryWithPayload, instead of calling
"newEntry.completed(true)" which acquires the entry monitor and makes a notifyAll call, maybe
we can directly set the state of the entry to "SUCCESS" in a new setComplete/constructor method.

   public void addCacheEntry(byte[] clientId, int callId) {
     CacheEntry newEntry = new CacheEntry(clientId, callId, System.nanoTime()
-        + expirationTime);
-    newEntry.completed(true);
-    set.put(newEntry);
+        + expirationTime, true);
+    synchronized (this) {
+      set.put(newEntry);  
+    }
> Name node crashes due to improper synchronization in RetryCache
> ---------------------------------------------------------------
>                 Key: HADOOP-9932
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9932
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: HADOOP-9932.patch, HADOOP-9932.patch
> In LightWeightCache#evictExpiredEntries(), the precondition check can fail. [~patwhitey2007]
ran a HA failover test and it occurred while the SBN was catching up with edits during a transition
to active. This caused NN to terminate.
> Here is my theory: If an RPC handler calls waitForCompletion() and it happens to remove
the head of the queue in get(), it will race with evictExpiredEntries() frrom put().

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message