hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Wed, 10 Oct 2012 17:29:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473384#comment-13473384

Todd Lipcon commented on HDFS-3077:

bq. I did not understand this well. Why are we retrying any request to JournalNodes? Given
most of the requests are not idempotent and cannot be retried why is this an advantage?

Currently we don't retry most requests, but it would actually be easy to support retries in
most cases, because the client always sends a unique <epoch, ipc serial number> in each
call. If the server receives the same epoch and serial number twice in a row, it can safely
re-respond with the previous response. This is not true for the newEpoch calls, because this
is where we _enforce_ the unique epoch ID.

As for _why_ we'd want to retry, it seems useful to be able to do so after a small network
blip between NN and JNs, for example.

Recover all transactions. We do this in the same fashion as ZAB rather then the way you suggested
"the NN can run the recovery process for each of these earlier segments individually". Note
this requires two changes:
- The protocol message contents change - the response to phase 1 is highest txid and not highest
segment's txid. The JN recovers all previous transactions from another JN.
- When a JN joins an existing writer it first syncs previous segments.

I'll agree to work on the improvement where we recover previous segments, but I disagree that
it should be done as part of the recovery phase. Here are my reasons:
- Currently with local journals, we don't do this. When an edits directory crashes and becomes
available again, we start writing to it without first re-copying all previous segments back
into it. This has worked fine for us. So I don't think we need a stronger guarantee on the
- The NN may maintain a few GB worth of edits due to various retention policies. If a JN crashes
and is reformatted, then this would imply that the JN has to copy multiple GB worth of data
from another JN before it can actively start participating as a destination for new logs.
This will take quite some time.
- Furthermore, because the JN will be syncing its logs from another JN, we need to make sure
the copying is throttled. Otherwise, the restart of a JN will suck up disk and network bandwidth
from the other JN which is trying to provide low latency logging for the active namenode.
If we didn't throttle it, the transfer and disk IO would balloon the latency for the namespace
significantly, which I think it's best to avoid. If we do throttle it (say to 10MB/sec), then
syncing several GB of logs will take several minutes, during which time the fault tolerance
of the cluster is compromised.
- Similar to the above, if there are several GB of logs to synchronize, this will impact NN
startup (or failover) time a lot.

I think, instead, the synchronization should be done as a background thread:
- The thread periodically wakes up and scans for any in-progress segments or gaps in the local
- If it finds one (and it is not the highest numbered segment), then it starts the synchronization
-- We reuse existing code to RPC to the other nodes and find one which has the finalized segment,
and copy it using the transfer throttler to avoid impacting latency of either the sending
or receiving node.

bq. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail
to see why it cannot work for us. I think because of (1), merging the two steps will no longer
be an issue.

I still don't understand why this is _better_ and not just _different_. If you and Suresh
want to make the change, it's OK by me, but I expect that you will re-run the same validation
before committing (eg run the randomized fault test a few hundred thousand times). This testing
found a bunch of errors in the design before, so any chance to the design should go through
the same test regimen to make sure we aren't missing some subtlety.

If the above sounds good to you, let's file the follow-up JIRAs and merge? Thanks.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message