hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal K <vishalm...@gmail.com>
Subject Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
Date Fri, 18 Jun 2010 23:45:19 GMT
Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0    [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x300000002 expected
0x1
=0    [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x300000002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x400000001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x400000001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x500000001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x500000001 expected
0x1
=0    [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x600000001 expected
0x1
=0    [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x600000001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x700000001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x700000001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal

On Fri, Jun 18, 2010 at 6:33 PM, Vishal K <vishalmlst@gmail.com> wrote:

>
> Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that
> the follower received the epoch before restart.
>
>
> On Fri, Jun 18, 2010 at 6:20 PM, Vishal K <vishalmlst@gmail.com> wrote:
>
>> I might be wrong here, but let me try to chip in my few cents.
>>
>> I think the problem is in LearnerHandler.java at the leader fo this
>> Follower.
>>
>>             /* see what other packets from the proposal
>>              * and tobeapplied queues need to be sent
>>              * and then decide if we can just send a DIFF
>>              * or we actually need to send the whole snapshot
>>              */
>>             long leaderLastZxid = leader.startForwarding(this, updates);
>> ---> this leaderLastZxid returned is probably incorrect.
>>             // a special case when both the ids are the same
>>             if (peerLastZxid == leaderLastZxid) {
>>                 packetToSend = Leader.DIFF;
>>                 zxidToSend = leaderLastZxid;
>>             }
>>
>>             QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
>>                     leaderLastZxid, null, null);
>>             oa.writeRecord(newLeaderQP, "packet");
>>             bufferedOutput.flush()
>>
>>
>>
>> On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) <
>> jira@apache.org> wrote:
>>
>>>
>>>    [
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880320#action_12880320]
>>>
>>> Flavio Paiva Junqueira commented on ZOOKEEPER-335:
>>> --------------------------------------------------
>>>
>>> Guys, I don't see enough information in these logs to determine what's
>>> going on. Let me tell you what I'm seeing so that perhaps other folks can
>>> help me out here.
>>>
>>> One part of the log that is suspicious is this one:
>>>
>>> {noformat}
>>> =6693 [QuorumPeer:/0.0.0.0:2181] WARN
>>>  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x300000001 expected
>>> 0x1
>>> =6693 [QuorumPeer:/0.0.0.0:2181] WARN
>>>  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x300000001 expected
>>> 0x1
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24]
>>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32]
>>>
>>> ************* NODE RESTARTED HERE **********************
>>> {noformat}
>>>
>>> Before being restarted, the bad node receives a proposal with zxid <3,1>
>>> and it expects <0,1>. Next in the logs after being restarted, I can see
that
>>> it is complaining that it has epoch 4 and the leader 3. Something strange
>>> apparently happened during the restart. It also seems to be the case that
>>> the node was being able to talk to the others (first entries in the log
>>> before the excerpt above).
>>>
>>> Do you guys see anything I'm overlooking?
>>>
>>> > zookeeper servers should commit the new leader txn to their logs.
>>> > -----------------------------------------------------------------
>>> >
>>> >                 Key: ZOOKEEPER-335
>>> >                 URL:
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-335
>>> >             Project: Zookeeper
>>> >          Issue Type: Bug
>>> >          Components: server
>>> >    Affects Versions: 3.1.0
>>> >            Reporter: Mahadev konar
>>> >            Assignee: Mahadev konar
>>> >            Priority: Blocker
>>> >             Fix For: 3.4.0
>>> >
>>> >         Attachments: zk.log.gz, zklogs.tar.gz
>>> >
>>> >
>>> > currently the zookeeper followers do not commit the new leader
>>> election. This will cause problems in a failure scenarios with a follower
>>> acking to the same leader txn id twice, which might be two different
>>> intermittent leaders and allowing them to propose two different txn's of the
>>> same zxid.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message