hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6353) Check and make checkpoint before stopping the NameNode
Date Wed, 25 Mar 2015 23:09:54 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380993#comment-14380993
] 

Jing Zhao commented on HDFS-6353:
---------------------------------

Had an offline discussion with Suresh and Jitendra. What we can do is:
# add a flag to allow users to disable the feature
# in the shell script, check the return value of saveNamespace command, and if it fails, print
error/hint msg and exit

We can do this in a follow up jira in trunk before merging the feature into branch-2.

> Check and make checkpoint before stopping the NameNode
> ------------------------------------------------------
>
>                 Key: HDFS-6353
>                 URL: https://issues.apache.org/jira/browse/HDFS-6353
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Suresh Srinivas
>            Assignee: Jing Zhao
>             Fix For: 3.0.0
>
>         Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, HDFS-6353.002.branch-2.patch,
HDFS-6353.002.branch-2.patch, HDFS-6353.002.patch
>
>
> One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency
the secondary or standby fails to consume editlog. The only solution when this happens is
to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting
admin might end up restarting the namenode, requiring more complicated solution to the problem
(such as ignore editlog record that cannot be consumed etc.).
> How about adding the following functionality:
> When checkpointer (standby or secondary) fails to consume editlog, based on a configurable
flag (on/off) to let the active namenode know about this failure. Active namenode can enters
safemode and saves namespace. When  in this type of safemode, namenode UI also shows information
about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode
can come out of safemode.
> This means service unavailability (even in HA cluster). But it might be worth it to avoid
long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message