hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingliang Liu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9129) Move the safemode block count into BlockManager
Date Sat, 24 Oct 2015 06:53:27 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Mingliang Liu updated HDFS-9129:
    Attachment: HDFS-9129.010.patch

Thank you [~wheat9] for your review. The v10 patch addresses the comments, along with rebasing
from {{trunk}} branch. As there are conflicts because of [HDFS-4015], I also invited [~arpitagarwal]
and [~anu] to review the patch.

See response inline to [~wheat9]'s comments.
1. It does not give much information compared to figuring out the issues on the code directly.
What does "thresholds met" / "extensions reached" mean? It causes more confusions than explanations.
The motivation is that diagram is helpful for glimpse, if we can provide definition of "thresholds
met" / "extension reached". In v10 patch, I add more explanations in the comments.

LOG.error("Non-recognized block manager safe mode status: {}", status);
2. Should be an assert.
Truely, I'll simply {{assert false : "some comment"}}.

private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;

isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
3. How do these two variables synchronize? Is the system in consistent state in the middle
of the execution?
Good question. Actually it's not in consistent state in the middle of the execution. If the
{{resourceLow}} is true, and before that name node is in manual safe mode, in the middle of
the execution {{isInSafeMode}} will return false, which means the safe mode is OFF. The main
reason is that writing to the two variables (aka enter/leave safemode) is guarded by the FSNamesystem
write lock, while read is not.

The enum type state was replace with two boolean flags in the v7 patch, because the two-lay
state machine was cumbersome per offline discussion with [~wheat9] and [~jingzhao]. Guarding
all the read looks expensive. Bitwise operation on a flag variable seems tricky.

The new design goes back to the {{trunk}} logic, which makes the block manager stays in safe
mode in the middle of the execution:
      if (resourcesLow) {
        isInResourceLowSafeMode = true;
      } else {
        isInManualSafeMode = true;
In case both {{isInManualSafeMode}} and {{isInResourceLowSafeMode}} are true, {{isInResourceLowSafeMode}}
will short-circuit {{isInManualSafeMode}}, according to our current logic, e.g. {{getTurnOffTip()}}.

+    bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+    assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+    assertTrue(bmSafeMode.isInSafeMode());
4. It makes sense to put it in a test instead of in the @Before method.
That makes sense to me. I'll add a new test called {{testSetBlockTotal}}.

+    // EXTENSION -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status", BMSafeModeStatus.EXTENSION);
+    reachBlockThreshold();
+    reachExtension();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
5. Please refactor the code – you can reuse the getSafeModeStatus() that is defined by the
Actually the first statment for each state transition tests is to *set* the internal state.
I'll make the fields package accessible so the Whitebox is not needed. See the end of this

assertEquals(getBlockSafe(), i);
6. The expected value should go first according to the function signature.
Nice catch. I will revise the whole test to fix all similar ones.

7. A higher level question is that why it needs so many getInternalState() statements? It
looks to me that many of these behaviors can be observed outside without whitebox testing.
The reason was that *all* non-static fields in {{BlockManagerSafeMode}} is private. By design,
I made them _private_ because we don't need access to its internal state in {{BlockManager}}.

In the new design, the v10 patch simply makes the fields package private so the test will
be more straight-forward.

> Move the safemode block count into BlockManager
> -----------------------------------------------
>                 Key: HDFS-9129
>                 URL: https://issues.apache.org/jira/browse/HDFS-9129
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Haohui Mai
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, HDFS-9129.002.patch, HDFS-9129.003.patch,
HDFS-9129.004.patch, HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, HDFS-9129.008.patch,
HDFS-9129.009.patch, HDFS-9129.010.patch
> The {{SafeMode}} needs to track whether there are enough blocks so that the NN can get
out of the safemode. These fields can moved to the {{BlockManager}} class.

This message was sent by Atlassian JIRA

View raw message