hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2010) Clean up and test behavior under failed edit streams
Date Wed, 29 Jun 2011 02:04:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056951#comment-13056951
] 

Matt Foley commented on HDFS-2010:
----------------------------------

Hi Aaron and Todd, in FSEditLog:
{code}
+      if (badJournals.size() >= stillGoodJournals.size()) {
+        LOG.error("Could not sync any journal to persistent storage. " +
+            "Unsynced transactions: " + (txid - synctxid));
+        runtime.exit(1);
+      }
{code}

The test "if (badJournals.size() >= stillGoodJournals.size())" probably should be "if (badJournals.size()
>= journals.size())", because:  Suppose you start with 5 journals, and fail 3 of them in
the block
{code}
        for (JournalAndStream jas : journals) {
          if (!jas.isActive()) continue;
          try {
            jas.getCurrentStream().setReadyToFlush();
            stillGoodJournals.add(jas);
          } catch (IOException ie) {
            LOG.error("Unable to get ready to flush.", ie);
            badJournals.add(jas);
          }
        }
{code}
Then suppose both remaining candidate journals actually sync successfully in the "// do the
sync" block.  You'll still conclude that (badJournals.size() >= stillGoodJournals.size()),
and wrongly call exit().

Also, I find the name "stillGoodJournals" confusing, because when a journal was found to be
bad, in the "// do the sync" block, it isn't removed from the "stillGoodJournals" list.  Perhaps
"candidateJournalsToSync" would be more descriptive?

> Clean up and test behavior under failed edit streams
> ----------------------------------------------------
>
>                 Key: HDFS-2010
>                 URL: https://issues.apache.org/jira/browse/HDFS-2010
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: name-node
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: hdfs-2010.0.patch, hdfs-2010.1.patch
>
>
> Right now there is very little test coverage of situations where one or more of the edits
directories fails. In trunk, the behavior when all of the edits directories are dead is that
the NN prints a fatal level log message and calls Runtime.exit(-1).
> I don't think this is really the behavior we want. Needs a bit of thought, but I think
something like the following would make more sense:
> - any calls currently waiting on logSync should end up throwing an exception
> - NN should probably enter safe mode
> - ops can restore edits directories and then ask the NN to restore storage, at which
point it could edit safemode
> - alternatively, ops could call ask the NN to do saveNamespace and then shut it down

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message