hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingliang Liu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9129) Move the safemode block count into BlockManager
Date Sat, 24 Oct 2015 06:53:27 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingliang Liu updated HDFS-9129:
--------------------------------
    Attachment: HDFS-9129.010.patch

Thank you [~wheat9] for your review. The v10 patch addresses the comments, along with rebasing
from {{trunk}} branch. As there are conflicts because of [HDFS-4015], I also invited [~arpitagarwal]
and [~anu] to review the patch.

See response inline to [~wheat9]'s comments.
{quote}
1. It does not give much information compared to figuring out the issues on the code directly.
What does "thresholds met" / "extensions reached" mean? It causes more confusions than explanations.
{quote}
The motivation is that diagram is helpful for glimpse, if we can provide definition of "thresholds
met" / "extension reached". In v10 patch, I add more explanations in the comments.

{quote}
{code}
LOG.error("Non-recognized block manager safe mode status: {}", status);
{code}
2. Should be an assert.
{quote}
Truely, I'll simply {{assert false : "some comment"}}.

{quote}
{code}
private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;

...
isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
{code}
3. How do these two variables synchronize? Is the system in consistent state in the middle
of the execution?
{quote}
Good question. Actually it's not in consistent state in the middle of the execution. If the
{{resourceLow}} is true, and before that name node is in manual safe mode, in the middle of
the execution {{isInSafeMode}} will return false, which means the safe mode is OFF. The main
reason is that writing to the two variables (aka enter/leave safemode) is guarded by the FSNamesystem
write lock, while read is not.

The enum type state was replace with two boolean flags in the v7 patch, because the two-lay
state machine was cumbersome per offline discussion with [~wheat9] and [~jingzhao]. Guarding
all the read looks expensive. Bitwise operation on a flag variable seems tricky.

The new design goes back to the {{trunk}} logic, which makes the block manager stays in safe
mode in the middle of the execution:
{code}
      if (resourcesLow) {
        isInResourceLowSafeMode = true;
      } else {
        isInManualSafeMode = true;
      }
{code}
In case both {{isInManualSafeMode}} and {{isInResourceLowSafeMode}} are true, {{isInResourceLowSafeMode}}
will short-circuit {{isInManualSafeMode}}, according to our current logic, e.g. {{getTurnOffTip()}}.

{quote}
{code}
+    // INITIALIZED -> THRESHOLD
+    bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+    assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+    assertTrue(bmSafeMode.isInSafeMode());
{code}
4. It makes sense to put it in a test instead of in the @Before method.
{quote}
That makes sense to me. I'll add a new test called {{testSetBlockTotal}}.

{quote}
{code}
+    // EXTENSION -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status", BMSafeModeStatus.EXTENSION);
+    reachBlockThreshold();
+    reachExtension();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
{code}
5. Please refactor the code – you can reuse the getSafeModeStatus() that is defined by the
class.
{quote}
Actually the first statment for each state transition tests is to *set* the internal state.
I'll make the fields package accessible so the Whitebox is not needed. See the end of this
comment.

{quote}
{code}
assertEquals(getBlockSafe(), i);
{code}
6. The expected value should go first according to the function signature.
{quote}
Nice catch. I will revise the whole test to fix all similar ones.

{quote}
7. A higher level question is that why it needs so many getInternalState() statements? It
looks to me that many of these behaviors can be observed outside without whitebox testing.
{quote}
The reason was that *all* non-static fields in {{BlockManagerSafeMode}} is private. By design,
I made them _private_ because we don't need access to its internal state in {{BlockManager}}.

In the new design, the v10 patch simply makes the fields package private so the test will
be more straight-forward.

> Move the safemode block count into BlockManager
> -----------------------------------------------
>
>                 Key: HDFS-9129
>                 URL: https://issues.apache.org/jira/browse/HDFS-9129
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Haohui Mai
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, HDFS-9129.002.patch, HDFS-9129.003.patch,
HDFS-9129.004.patch, HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, HDFS-9129.008.patch,
HDFS-9129.009.patch, HDFS-9129.010.patch
>
>
> The {{SafeMode}} needs to track whether there are enough blocks so that the NN can get
out of the safemode. These fields can moved to the {{BlockManager}} class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message