hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Mon, 08 Oct 2012 18:56:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471762#comment-13471762

Todd Lipcon commented on HDFS-3077:

bq. the QJM is not replaceable by local disk journal if QJM is not available because the local
disk journal and qjm will not be consistent

In a shared-edits setup, the shared edits are always synced ahead of the local disk journals.
This means that anything that's committed locally will also be in the QJM. If the QJM fails
to sync, then the NN aborts (since the shared edits are marked as "required"). So, they're
not "consistent" but you can always take finalized edits from a JN and copy them to a local
drive. In the case of some disaster, you can also freely intermingle the files - having that
flexibility without having to hex-edit out a header seems prudent.

More importantly, IMO, you can continue to run tools like the OfflineEditsViewer against edits
stored on a JN.

bq. We may have to revise the journal abstraction at little to deal with the above situation
(independent of storing epoch in the first entry) since a QJM+localDisk journal is useful.

This is in fact the way in which we've been doing all of our HA testing (and now some customer
deploys). We use local FileJournalManagers on each NN, and the QuorumJournalManager as shared
edits. Per above, this works fine and doesn't have any "lost edit" issues.

Can you be specific about the consistency issue you're foreseeing here?

bq. Todd the change is small and i am trying to help you here.

Maybe I'm mis-understanding the change. More below on this...

bq. Recall in HDFS-1073 you did not want to use transaction ids or name the log files using
transaction id range and you argued against this for quite a while. As I predicted, txids
have become a cornerstone of HA and managing journals

To be clear, the 1073 design was always using transaction IDs, it was just a matter of the
file naming that we argued about. But I don't think it's productive to argue about the past

bq. You have argued that prepare-recovery, using the epoch number from previous newEpoch,
is like multi-paxos - not sure if multi-paxos is warranted here.

Can you explain what you mean by "not sure if multi-paxos is warranted here?" I just meant
that, similar to multi-paxos, you can use an earlier promise to verify all future messages
against that earlier epoch ID. Otherwise each operation would require its own new epoch, and
that's clearly not what we want.

bq. The response of the newEpoch() is highest txId while response to PrepareRecovery is the
state of the highest segment, and optionally additional info if there was a previous recovery
in progress

I think there is some confusion here. The response to newEpoch is the highest segment txid,
but the highest segment txid may not match up across all of the JNs. On a 3-JN setup, you
may have three different responses to NewEpoch. For example, the following scenario:

1. NN writing segment starting at txid 1
2. JN1 crashes after txid 50
3. NN successfully rolls, starts txid 101
4. NN successfully finalizes segment 101-200. Starts segment 201
5. NN sends txid 201 to JN2 and JN3, but JN3 crashes before receiving it.
6. Everyone shuts down and restarts.

The current state is then:
JN1: highestSegmentTxId = 1, since it had crashed during that segment
JN2: highestSegmentTxId = 101, since it never got any transactions for segment 201
JN3: highestSegmentTxId = 201, since it got one transaction for segment 201

Then, depending on which JNs are available during recovery, the prepareRecovery() call is
different, and thus, we'd need different responses. It's not really simple to piggy-back the
segment info on NewEpoch, because we don't yet know which segment is the one to be recovered
(it may be some segment that is only available on one live node)

Am I misunderstanding the concrete change that you're proposing? Maybe you can post a patch?

bq. In step 3b you state that recovery metadata is created and then deleted in step 4. Isn't
the updated journal file sufficient? In paxos when phase 2 is completed, paxos protocol has
essentially completed when quorum number of Journal have learned the new value. From what
i understand, even in ZAB the journal is updated at that stage and no separate metadata is

The updated journal file isn't sufficient because it doesn't record information about whether
it was an accepted recovery proposal or whether it was just left over at the last write. You
need to ensure the property that, if the recovery coordinator thinks a value is accepted,
then no different recovery will be accepted in the future (otherwise you risk having two different
finalized lengths for the same log segment). In order to do so, you need to wait until a quorum
of nodes are Finalized before you know that any future recovery will be able to rely only
on the finalization state.

I don't know enough about the details of the ZAB implementation to understand why they can
get away without this, if in fact they can. My guess is that it's because the transaction
IDs themselves have the epoch number as their high order bits, and hence you can't ever confuse
the first txn of epoch N+1 with the last transaction of epoch N.

bq. The final step (finalize-segment or ZAB's commit) is really to lets all the JNs know that
the new writer is the leader and that they can publish the data to other readers (the standBy
in our case).

Agreed. At this point we delete the metadata for recovery, but we don't necessarily have to.
It's just a convenient place to do the cleanup.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message