hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
Date Tue, 19 May 2015 21:10:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551234#comment-14551234
] 

Jing Zhao commented on HDFS-7991:
---------------------------------

Recently we just saw several clusters from our customers where the NameNodes were stopped
without checking/doing checkpoint. This lead to hours of downtime for loading large amounts
of editlog (some clusters also hit the issue reported by HDFS-7609 which makes things worse).

I had an offline discussion with [~cnauroth] and [~jnp] about this functionality. Here is
the summary of the options we can come up with:
# The solution developed in the current patch: the script sends saveNamespace request to the
NameNode before stopping it, and the NameNode does an extra checkpoint if necessary based
on the time of the latest checkpoint and the total number of transactions outside of the checkpoint.
The drawback of the method is that if the checkpoint is necessary, the admin will see the
stopping command blocked for 10min or more. And the admin can also get confused if the saveNamespace
command fails.
# Another way is that, instead of issuing the saveNamespace command directly, the script checks
the  time of the latest checkpoint and the total number of transactions first (maybe through
the jmxget command). If it is necessary to do a checkpoint, the script will abort and print
out some warning msg asking the admin to run "dfsadmin -saveNamespace". This avoids the long
time waiting from solution #1. Also if the jmxget command fails, the admin can use some command
argument to force stopping the NameNode if he/she can confirm the checkpoint is not necessary.
# The third option is to move the checkpoint logic into the shutdown hook of the NameNode.
The biggest challenge here is the sync between the server and the script, i.e., to decide
when and whether to kill the NN in the script. The script may have to polling the current
state of the NameNode and guess whether the NameNode is still doing a checkpoint or it hangs
somewhere else. Currently I do not see an easy way to achieve this.

For now we think #2 may be the best solution. I will update the patch accordingly. [~aw],
could you please also share your thoughts here? Thanks.


> Allow users to skip checkpoint when stopping NameNode
> -----------------------------------------------------
>
>                 Key: HDFS-7991
>                 URL: https://issues.apache.org/jira/browse/HDFS-7991
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch,
HDFS-7991.004.patch
>
>
> This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving
namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message