hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3726) QJM: if a logger misses an RPC, don't retry that logger until next segment
Date Wed, 05 Sep 2012 07:04:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448529#comment-13448529
] 

Todd Lipcon commented on HDFS-3726:
-----------------------------------

Been thinking about this a bit, and I think it might actually make sense to set the "outOfSync"
flag on any exception. For example, if the JN has gone down, the client will receive generic
IOExceptions. We know that, as soon as we miss any edit in the segment due to the IOE, there's
no sense sending more since they will just result in JournalOutOfSync.

If folks agree, I will make that change.

I also want to make another small change at the same time: any time one of the JNs throws
an exception, we should log it, even if a majority succeeded. That will help cluster administrators
diagnose the case when one of the JNs has gone down or having disk issues. Currently, these
types of errors are silent on the client side, which is not so good.
                
> QJM: if a logger misses an RPC, don't retry that logger until next segment
> --------------------------------------------------------------------------
>
>                 Key: HDFS-3726
>                 URL: https://issues.apache.org/jira/browse/HDFS-3726
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3726.txt
>
>
> Currently, if a logger misses an RPC in the middle of a log segment, or misses the {{startLogSegment}}
RPC (eg it was down or network was disconnected during that time period), then it will throw
an exception on every subsequent {{journal()}} call in that segment, since it knows that it
missed some edits in the middle.
> We should change this exception to a specific IOE subclass, and have the client side
of QJM detect the situation and stop sending IPCs until the next {{startLogSegment}} call.
> This isn't critical for correctness but will help reduce log spew on both sides.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message