zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Camille Fournier (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-962) leader/follower coherence issue when follower is receiving a DIFF
Date Mon, 10 Jan 2011 16:24:45 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979623#action_12979623

Camille Fournier commented on ZOOKEEPER-962:

I made a reviewboard for this myself acutally, https://reviews.apache.org/r/253/, I guess
it didn't get emailed to the group properly? I'll cancle it and use yours since we've already
got comments in there.

Let me see if I understand what you're getting at:
We see snap to Z0. We start to see the toBeApplied proposals being forwarded, and we write
them to the log, but before we get the UPTODATE (and thus write the snap file to disk), we
crash. So we have a log file with Z1, but no snap file with the older data.

Yes, you're right, that's a nasty problem. 

Are you proposing for all packets to UPTODATE in all syncWithLeader scenarios, we process
them inside SyncWithLeader, or just the ones for snap?

I think I can twiddle my test to catch this error, let me look at it. I'm not sure how much
time I will have to actually make the fix you are proposing, though, so if you have time to
try to add it onto my patch please let me know.

> leader/follower coherence issue when follower is receiving a DIFF
> -----------------------------------------------------------------
>                 Key: ZOOKEEPER-962
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-962
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.2
>            Reporter: Camille Fournier
>            Assignee: Camille Fournier
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>         Attachments: ZOOKEEPER-962.patch
> From mailing list:
> It seems like we rely on the LearnerHandler thread startup to capture all of the missing
> transactions in the SNAP or DIFF, but I don't see anything (especially in the DIFF case)
> is preventing us for committing more transactions before we actually start forwarding
> to the new follower.
> Let me explain using my example from ZOOKEEPER-919. Assume we have quorum already, so
> leader can be processing transactions while my follower is starting up.
> I'm a follower at zxid N-5, the leader is at N. I send my FOLLOWERINFO packet to the
> with that information. The leader gets the proposals from its committed log (time T1),
> syncs on the proposal list (LearnerHandler line 267. Why? It's a copy of the underlying
> list... this might be part of our problem). I check to see if the peerLastZxid is within
> max and min committed log and it is, so I'm going to send a diff. I set the zxidToSend
> be the maxCommittedLog at time T3 (we already know this is sketchy), and forward the
> from my copied proposal list starting at the peerLastZxid+1 up to the last proposal transaction
> (as seen at time T1).
> After I have queued up all those diffs to send, I tell the leader to startFowarding updates
> to this follower (line 308). 
> So, let's say that at time T2 I actually swap out the leader to the thread that is handling
> the various request processors, and see that I got enough votes to commit zxid N+1. I
> N+1 and so my maxCommittedLog at T3 is N+1, but this proposal is not in the list of proposals
> that I got back at time T1, so I don't forward this diff to the client. Additionally,
I processed
> the commit and removed it from my leader's toBeApplied list. So when I call startForwarding
> for this new follower, I don't see this transaction as a transaction to be forwarded.

> There's one problem. Let's also imagine, however, that I commit N+1 at time T4. The maxCommittedLog
> value is consistent with the max of the diff packets I am going to send the follower.
> I still committed N+1 and removed it from the toBeApplied list before calling startFowarding
> with this follower. How does the follower get this transaction? Does it?
> To put it another way, here is the thread interaction, hopefully formatted so you can
> it...
> 		LearnerHandlerThread					RequestProcessorThread
> T1(LH):	get list of proposals (COPY)
> T2(RPT):								commit N+1, remove from toBeApplied
> T3(LH):	get maxCommittedLog
> T4(LH):	send diffs from view at T1
> T5(LH):	startForwarding
> Or
> T1(LH):	get list of proposals (COPY)
> T2(LH):	get maxCommittedLog
> T3(RPT):								commit N+1, remove from toBeApplied
> T4(LH):	send diffs from view at T1
> T5(LH):	startFowarding
> I'm trying to figure out what, if anything, keeps the requests from being committed,
> and never seen by the follower before it fully starts up. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message