hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Xiaoqiao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9068) SBN checkpoint could not work after the only name directory recovery from failure
Date Mon, 14 Sep 2015 03:59:45 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Xiaoqiao updated HDFS-9068:
------------------------------
    Attachment: HDFS-9068.patch

Attach patch: check failure directory if OK before saving fsimage.

> SBN checkpoint could not work after the only name directory recovery from failure
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-9068
>                 URL: https://issues.apache.org/jira/browse/HDFS-9068
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: He Xiaoqiao
>         Attachments: HDFS-9068.patch
>
>
> SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the checkpointer could
not work when there is only one directory in configuration item {{dfs.namenode.name.dir}}
and the disk which the directory located recoveries from failure.
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
> {code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid}
> @Override
> public void run() {
>   try {
>     saveFSImage(context, sd, nnf);
>   } catch (SaveNamespaceCancelledException snce) {
>     LOG.info("Cancelled image saving for " + sd.getRoot() +
>         ": " + snce.getMessage());
>     // don't report an error on the storage dir!
>   } catch (Throwable t) {
>     LOG.error("Unable to save image for " + sd.getRoot(), t);
>     context.reportErrorOnStorageDirectory(sd);
>   }
> }
> {code}
> sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it will never
be used when {{saveFSImage(context, sd, nnf)}} failed becasue storage is Not available or
failed even if it recovers from failure. Then JournalNode will accumulate a large number of
editlog files since checkpointer failed and NameNode will restart for log time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message