hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
Date Tue, 12 Nov 2013 07:49:17 GMT

    [ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819895#comment-13819895
] 

Bikas Saha commented on YARN-1222:
----------------------------------

Looks good.

Some minor comments.

In default fencing, we rely on the default zkacl to allow shared admin access for both RM's
right? Can we document that please?

Should these and others in handleStoreEvent() be inside the try block. If we caught an exception
then we should not be informing the apps that their store operation is complete. We are probably
getting away with it because everything happens on the async dispatcher thread. Hence the
RM transitions to standby before these notification events are processed.
{code}
         }
       } catch (Exception e) {
         LOG.error("Error storing app: " + appId, e);
-        storedException = e;
-      } finally {
-        if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
-          notifyDoneStoringApplication(appId, storedException);
-        } else {
-          notifyDoneUpdatingApplication(appId, storedException);
-        }
+        notifyStoreOperationFailed(e);
+      }
+      if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
+        notifyDoneStoringApplication(appId, storedException);
+      } else {
+        notifyDoneUpdatingApplication(appId, storedException);
       }
{code}

Log in main and catch block and before ExitUtil would help in debugging.
{code}+          if (haService.haEnabled) {
+            try {
+              // Transition to standby and reinit active services
+              haService.transitionToStandby(true);
+              return;
+            } catch (Exception e) {
+              // Do nothing. RM is going to shutdown
+            }
{code}

Why is there a new synchronized block here? What needs to be guarded against concurrent modification?
{code}
-          if (shouldRetry(ke.code()) && ++retry < numRetries) {
-            continue;
+          synchronized (ZKRMStateStore.this) {
+            if (shouldRetry(ke.code()) && ++retry < numRetries) {
+              continue;
+            }
{code}

The test should probably create separate copies of conf for the 2 RM's

In the test, lets put a comment saying, triggering a state store operation that makes rm1
realize that its not the master because it got fenced by the store.



> Make improvements in ZKRMStateStore for fencing
> -----------------------------------------------
>
>                 Key: YARN-1222
>                 URL: https://issues.apache.org/jira/browse/YARN-1222
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Karthik Kambatla
>         Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, yarn-1222-4.patch,
yarn-1222-5.patch, yarn-1222-6.patch, yarn-1222-7.patch, yarn-1222-8.patch, yarn-1222-8.patch
>
>
> Using multi-operations for every ZK interaction. 
> In every operation, automatically creating/deleting a lock znode that is the child of
the root znode. This is to achieve fencing by modifying the create/delete permissions on the
root znode.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message