hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4233) NN keeps serving even after no journals started while rolling edit
Date Thu, 29 Nov 2012 17:32:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506606#comment-13506606
] 

Kihwal Lee commented on HDFS-4233:
----------------------------------

Here are more details: rollEditLog() was called via RPC from SNN and opening of new edit files
failed. The exception was sent back to the caller, but no action was taken locally. From this
point on, the edit log state is  BETWEEN_LOG_SEGMENTS and no further rolling was allowed because
endCurrentLogSegment() fails. But logSync() and logEdit() went on as if nothing is wrong.

Trunk does not have this issue. In {{mapJournalsAndReportErrors()}}, if a journal marked as
required fails, namenode will terminate. But if none is marked required, it will simply throw
an exception even if all journals fail. But logSync() will log FATAL and terminate since JournalSet#isEmpty()
works diferently in trunk.

In branch-0.23, FSEditLog maintains a list of journals. logSync() invokes isEmpty(), but it
won't check the validity of journals in the list. Instead it checks one by one in a loop.
Although it already has a logic for counting and disabling bad journals, there is nothing
equivalent to the resource availability check in trunk/branch-2.  I think the best place to
add this is {disableAndReportErrorOnJournals()}. This will make the failure behavior almost
same as what is already implemented in truck/branch-2.

This issue does not exit in branch-1, where rollEditLog() clears {{editStreams}} before creating
new edit files. Since it calls {{exitIfNoStreams()}} before returning, namenode will terminate
if no edit stream was successfully created.

As for test cases, trunk already has TestEditLogJournalFailures.  I will create a new patch
for branch-0.23 and a test case.
                
> NN keeps serving even after no journals started while rolling edit
> ------------------------------------------------------------------
>
>                 Key: HDFS-4233
>                 URL: https://issues.apache.org/jira/browse/HDFS-4233
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 0.23.5
>            Reporter: Kihwal Lee
>            Priority: Blocker
>         Attachments: hdfs-4233-branch-0.23-quick-death.patch
>
>
> We've seen namenode keeps serving even after rollEditLog() failure. Instead of taking
a corrective action or regard this condition as FATAL, it keeps on serving and modifying its
file system state. No logs are written from this point, so if the namenode is restarted, there
will be data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message