hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
Date Fri, 22 May 2015 17:24:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556470#comment-14556470
] 

Suresh Srinivas edited comment on HDFS-7991 at 5/22/15 5:23 PM:
----------------------------------------------------------------

bq. I've been working under the assumption that the sites that are hitting this issue are
running a secondary namenode. Is that not true? Doesn't the 2NN make this whole issue go away?

Ideally when 2NN or standby is working. But we have had many issues where checkpointing is
not done by SNN or standby, for the following reasons:
1. editlog had an issue and could not be consumed by 2NN or standby
2. checkpointing is lagging behind (see HDFS-7609)
3. There could many others bugs and issues (standby down etc) that could result in delayed
checkpoint

Repeating myself, this is a very important functionality to avoid data loss and service unavailability.
But we need a way to be able to save namespace. Today operators who understand this situation
do save namespace manually before stopping the namenode. People who miss doing that run into
production issues. This jira proposes automatically saving namespace to avoid issues. I don't
understand why it "hacking the hell out of stuff".

[~vinayrpet], some comments:
bq. What if machine itself goes down suddenly after running for months/years, having tons
of millions of edits without checkpoint ?
Yes there are times when saving namespace may not be possible. But in large majority of case,
when HDFS issues are seen, inexperienced administrators just restart the cluster and run into
this issue. 

bq. Anyway doing checkpoint in Active NameNode is not a big deal
If doing checkpointing in the active namenode was possible without pausing the ongoing requests,
we would not have moved checkpointing to either secondary or standby. That is also the reason
why the namenode is first put into safemode, the write request are quiesced, and then save
namespace is called.


was (Author: sureshms):
bq. I've been working under the assumption that the sites that are hitting this issue are
running a secondary namenode. Is that not true? Doesn't the 2NN make this whole issue go away?

Ideally when 2NN or standby is working. But we have had many issues where checkpointing is
not done by SNN or standby, for the following reasons:
1. editlog had an issue and could not be consumed by 2NN or standby
2. checkpointing is lagging behind (see HDFS-7609)
3. There could many others bugs and issues (standby down etc) that could result in delayed
checkpoint

Repeating myself, this is a very important functionality to avoid data loss and service unavailability.
But we need a way to be able to save namespace. Today operators who understand this situation
do save namespace manually before stopping the namenode. People who miss doing that run into
production issues. This jira proposes automatically saving namespace to avoid issues. I don't
understand why it "hacking the hell out of stuff".

[~vinayrpet], some comments:
bq. What if machine itself goes down suddenly after running for months/years, having tons
of millions of edits without checkpoint ?
Yes there are times when saving namespace may not be possible. But in large majority of case,
when HDFS issues are seen, inexperienced administrators just restart the cluster and run into
this issue. 

bq. Anyway doing checkpoint in Active NameNode is not a big deal
If doing checkpointing in the active namenode was possible without pausing the ongoing requests,
we would not have moved to checkpointing to either secondary or standby. That is also the
reason why the namenode is first put into safemode, the write request are quiesced, and then
save namespace is called.

> Allow users to skip checkpoint when stopping NameNode
> -----------------------------------------------------
>
>                 Key: HDFS-7991
>                 URL: https://issues.apache.org/jira/browse/HDFS-7991
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch
>
>
> This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving
namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message