hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Thu, 27 Sep 2012 22:21:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465164#comment-13465164

Todd Lipcon commented on HDFS-3077:

bq. "Henceforth we will refer to these nodes as replicas." Please use a different term as
replicas is heavily used in the context of block replica in HDFS. Perhaps Journal Replicas
may be a better name.


bq. "Before taking action in response to any RPC, the JournalNode checks the requester's epoch
number against its lastPromisedEpoch variable. If the requester's epoch is lower, then it
will reject the request". This is only true for all the RPCs other than newEpoch. Further
it should say if the requester's epoch is not equal to lastPromisedEpoch the request is rejected.

Fixed. Actually, if any request comes with a higher epoch than lastPromisedEpoch, then the
JN accepts the request and also updates lastPromisedEpoch. This allows a JournalNode to join
back into the quorum properly even if it was down when the writer became active.

bq. In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the description should
also read "JNs" instead of "QJMs".

Thanks, fixed.

bq. In step 4. "Otherwise, it aborts the attempt to become the active writer." What is the
state of QJM after this at the namenode? More details needed.

 Otherwise, it aborts the attempt to become the active writer by throwing
 an IOException. This will be handled by the NameNode in the same fashion as a failure to
 to an NFS mount -- if the QJM is being used as a shared edits volume, it will cause the NN

bq. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the last segments
or all the segments (required for a given fsimage?). Based on the answer, section 2.8 might
require updates.

It only synchronizes the last log segment. Any earlier segments are already guaranteed to
be finalized on a quorum of nodes (either by a postcondition of the recovery process, or by
the fact that a new segment is not started by a writer until the previous one is finalized
on a quorum).

In the future, we might have a background thread synchronizing earlier log segments to "minority
JNs" who were down when they were written, but we already have a guarantee that a quorum has
every segment.

bq. Say a new JN is added or an older JN came backup during restart of the cluster. I think
you may achieve quorum without the overlap of a node that was part of previous quorum write.
This could result in loading stale journal. How do we handle this? Is set of JNs that the
system was configured/working with?

The JNs don't auto-format themselves, so if you bring up a new one with no data, or otherwise
end up contacting one that wasn't part of the previous quorum, then it won't be able to respond
to the newEpoch() call. It will throw a "JournalNotFormattedException".

As for adding new journals, the process today would be:
a) shut down HDFS cleanly
b) rsync one of the JN directories to the new nodes
c) add new nodes to the qjournal URI
d) restart HDFS

As I understand it, this is how new nodes are added to ZooKeeper quorums as well. In the future
we might add a feature to help admins with this, but it's a really rare circumstance, so I
think it's better to eschew the complexity in the initial release. (ZK itself is only adding
online quorum reconfiguration now)

The JNs also keep track of the namespace ID and will reject requests from a writer if his
nsid doesn't match, which prevents accidental "mixing" of nodes between clusters.

bq. What is the effect of newEpoch from another writer on a JournalNode that is performing
recovery, especially when it is performing AcceptRecovery? It would be good to cover what
happens in other states as well.

Since all of the RPC calls are synchronized, there are no race conditions _during_ the RPC.
If a new writer performs newEpoch before acceptRecovery, then the acceptRecovery call will
fail. If the new writer performs newEpoch after acceptRecovery, then the new one will get
word of the previous writer's recovery proposal when it calls prepareRecovery().

This part follows Paxos pretty closely, and I didn't want to digress too much into explaining
Paxos in the design doc. I'm happy to add an Appendix with a couple of these examples, though,
if you think that would be useful.

bq. In "Prepare Recovery RPC", how does writer use previously accepted recovery proposal?

Per above, this follows Paxos. If there are previously accepted proposals, then the new writer
chooses them preferentially even if there are other segments which might be longer -- see
section 3.9 point 3.a.

bq. Does accept recovery wait till journal segments are downloaded? How does the timeout work
for this?

Yep, it downloads the new segment, then atomically renames the segment from its temporary
location and records the accepted recovery. The timeout here is the same as "dfs.image.transfer.timeout"
(default 60sec). If it times out, then it will throw an exception and not accept the recovery.
If the writer performing recovery doesn't succeed on a majority of nodes, then it will fail
at this step.

bq. Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that logger's
lastWriterEpoch and the epoch number corresponding to any previously accepted recovery proposal."
Can you explain in section 2.10.6 why previously accepted recovery proposal needs to be considered?

This is necessary in case a writer fails in the middle of recovery. Here's an example, which
I'll also add to the design doc:

Assume we have failed with the three JNs at different lengths, as in Example 2.10.2:

|| JN  || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
| JN1 | edits_inprogress_101 | 150 | - | 1 |
| JN2 | edits_inprogress_101 | 153 | - | 1 |
| JN3 | edits_inprogress_101 | 125 | - | 1 |

Now assume that the first recovery attempt only contacts JN1 and JN3. It decides that length
150 is the
correct recovery length, and calls {{acceptRecovery(150)}} on JN1 and JN3, followed by
{{ finalizeLogSegment(101-150) }}. But, it crashes before the {{finalizeLogSegment}} call
reaches JN1.
The state now is:

|| JN  || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
|JN1 | edits_inprogress_101 | 150 | 2 | 1 |
|JN2 | edits_inprogress_101 | 153 | - | 1 |
|JN3 | edits_101-150 | 150 | - | 1 |

When a new NN now begins recovery, assume it talks only to JN1 and JN2. If it did not consider
{{acceptedInEpoch}}, it would incorrectly decide to finalize to txid 153, which would break
the invariant that finalized log segments beginning at the same transaction ID must have the
same length. Because of Rule 3.b, it will instead choose JN1 again as the
recovery master, and properly finalize JN1 and JN2 to txid 150 instead of 153, which match
the now-crashed JN3.

bq. Section 3 - since a reader can read from any JN, if the JN it is reading from gets disconnected
from active, does the reader know about it? How does this work especially in the context of
standby namenode?

Though the SBN can read each segment from any one of the JNs, it actually sends the "getEditLogManifest"
to _all_ of the JNs. Then, it takes the results, and merges them using RedundantEditLogInputStream.
So, if two JNs are up which have a certain segment, then both are available for reading. If,
then, in the middle of the read, it crashes, the SBN can "fail over" to reading from the other
JN that had this same segment.

bq. Following additional things would be good to cover in the design:
bq. Cover boot strapping of JournalNode and how it is formatted

Added a section to the design doc on bootstrap and format

bq. Section 2.8 "replacing any current copy of the log segment". Need more details here. Is
it possible that we delete a segment and due to correlated failures, we lose the journal data
in the process. So replacing must perhaps keep the old log segment until the segment recovery

Can you give a specific example of what you mean here? We don't delete the existing segment
except when we are moving a new one on top of it -- and the new one has already been determined
to be a "valid recovery". The download process via HTTP also uses FileChannel.force() after
downloading to be sure that the new file is fully on disk before it is moved into place.

bq. How is addition, deletion and JN becoming live again from the previous state of dead/very
slow handled?

On each segment roll, the client will again retry writing to all of the JNs, even those that
had been marked "out of sync" during the previous log segment. If it's just lagging a bit,
then the queueing in the IPCLoggerChannel handles that (it'll start the new segment a bit
behind the other nodes, but that's fine). Is there a psecific example I can explain that would
make this clearer?

bq. I am still concerned (see my previous comments about epochs using JNs) that a NN that
does not hold the ZK lock can still cause service interruption. This is could be considered
later as an enhancement. This probably is a bigger discussion.

Yea, I agree this is worth a separate discussion. There's no real way to tie a ZK lock to
anything except for ZK data - you can always think you have the lock, but by the time you
take action, not have it anymore.

bq. I saw couple of white space/empty line changes

Will take care of these, sorry.

bq. Also moving some of the documentation around can be done in trunk, or that particular
change can be merged to trunk to keep this patch smaller.

It seems wrong to merge the docs change to trunk when the code it's documenting isn't there,
yet. Aaron posted some helpful diffs with the docs on HDFS-3926if you want to review the diff
without all the extra diff caused by the moving.

bq. An additional comment - in 3092 design during recovery we had just fence (newEpoch() here)
and roll. I am not sure why recovery needs to have so many steps - prepare, accept and roll.
Can you please describe what I am missing?

I think some of the above comments may explain this - in particular the reason why you need
the idea of accepting recovery prior to committing it. Otherwise, I'll turn the question on
its head: why do you think you can get away with so few steps? Perhaps it's possible in a
system that requires every write to go
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message