hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer
Date Fri, 30 Dec 2011 01:01:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177483#comment-13177483

Aaron T. Myers commented on HDFS-2709:

Thanks a lot for the thorough review, Eli. Comments inline. I also found and fixed another
little bug involving a potential race between the edits log tailer thread and rolling edits
logs. I'll post an updated patch in a moment.

bq. This change handles errors reading an edit from the log (the common case) but not when
there's a failure to apply an edit (eg if there was a bug, or a silent corruption somehow
went unnoticed). While loadEdits won't ignore (will throw) this exception it does get propagated
up to the catch of Throwable in EditLogTailer#run so we effectively retry endlessly in this
case. Need to replace the TODO(HA) comment there with code to shutdown the SBN. Feel free
to punt to another jira.

Indeed, I had originally intended to do this as part of a separate JIRA, but I'm rethinking
that decision. I've added some code to shutdown the SBN, and amended the tests to verify this

bq. How about adding a test that uses multiple shared edits dirs, and shows that a failure
to read from one of them will cause the tailer to not catch up, can file a jira for a future
change that is OK with faulty shared dirs as long as one is working.

Multiple shared edits dirs isn't currently supported or tested. It's certainly an obvious
improvement worth doing, but there are currently no tests for it. We should probably file
a JIRA to test that.

bq. In FileJournalManager#getNumberOfTransactions, not that the we loosen the check to elf.containsTxId(fromTxid)
isn't the last else case dead code?

Yes indeed, not sure how I missed that. Removed.

bq. I think we can remove the "TODO(HA): Should this happen when called by the tailer?" comment
in loadEdits right since we always create new streams when we select them?

Yes indeed. Removed.

bq. Would it be simpler in LimitedEditLogAnswer#answer to spy on each stream and stub readOp
rather than introduce LimitedEditLogInputStream?

Different? Yes. Simpler? Maybe. I did it this way because I thought creating spies within
spies was kind of gross. I switched it to use a spy in this latest patch, which is at least
less code. :)

bq. How about introducing DFSHATestUtil and put waitForStandbyToCatchUp and CouldNotCatchUpException
there? Seems like the methods you pointed out in the HDFS-2692 review could go there as well).

Good idea. Let's do it in a separate JIRA though, along the lines of "consolidate generic
HA test helper methods."

bq. Nit: "IOException e", s/e/ioe/


bq. testFailuretoReadEdits needs a javadoc


bq. waitForStandbyToCatchUp needs a javadoc indicating it waits for NN_LAG_TIMEOUT then throws

> HA: Appropriately handle error conditions in EditLogTailer
> ----------------------------------------------------------
>                 Key: HDFS-2709
>                 URL: https://issues.apache.org/jira/browse/HDFS-2709
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>            Priority: Critical
>         Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch
> Currently if the edit log tailer experiences an error replaying edits in the middle of
a file, it will go back to retrying from the beginning of the file on the next tailing iteration.
This is incorrect since many of the edits will have already been replayed, and not all edits
are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized file (ie
skip those edits already applied), or (b) abort the standby if it hits an error while tailing.
If "a" isn't simple, let's do "b" for now and come back to 'a' later since this is a rare
circumstance and better to abort than be incorrect.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message