hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <traviscrawf...@gmail.com>
Subject Re: Recovery issue - how to debug?
Date Mon, 19 Apr 2010 18:55:30 GMT
To double-check, is the best way to tell a ZK instance is up-to-date
by looking at its ``LastZxid`` value? For example:

$ java -jar /home/travis/cmdline-jmxclient-0.10.5.jar - localhost:8081
org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower,name3=InMemoryDataTree
LastZxid
04/19/2010 18:42:45 +0000 org.archive.jmx.Client LastZxid: 0xf000420ad

I believe the ``LastZxid`` for each ZK instance needs to be compared
to the leader to see how far behind it is.


It would be a lot easier from the operations perspective if the leader
explicitly published some health stats:

(a) Count of instances in the ensemble.
(b) Count of up-to-date instances in the ensemble.

This would greatly simplify monitoring & alerting - when an instance
falls behind one could configure their monitoring system to let
someone know and take a look at the logs.

--travis




On Mon, Apr 19, 2010 at 10:14 AM, Patrick Hunt <phunt@apache.org> wrote:
> Usually the server logs will shed light on such issues. If we had access to
> them it might be easier to speculate.
>
> Patrick
>
> On 04/19/2010 09:22 AM, Mahadev Konar wrote:
>>
>> Hi Hao,
>>   As Vishal already asked, how are you determining if the writes are being
>> received?
>>  Also, what was the status of C2 when you checked for these writes? Do you
>> have the output of echo "stat" | nc localhost port?
>>
>> How long did you wait when you say that C2 did not received the writes?
>> What
>> was the status of C2 (again echo "stat" | nc localhost port) when you saw
>> the C2 had received the writes?
>>
>> Thanks
>> mahadev
>>
>>
>> On 4/18/10 10:54 PM, "Dr Hao He"<he@softtouchit.com>  wrote:
>>
>>> I have zookeeper cluster E1 with 3 nodes A,B, and C.
>>>
>>> I stopped C and did some writes on E1.  Both A and B received the writes.
>>>  I
>>> then started C and after a short while, C also received the writes.
>>>
>>> All seem to go well so I replicated the setup to another cluster E2 with
>>> exactly 3 nodes: A2, B2, and C2.
>>>
>>> I stopped C2 and did some writes on E2.  A2 received the writes.  I then
>>> started C2.  However, no matter how long I wait, C2 never received the
>>> writes.
>>>
>>> I then did more writes on E2.  Then C2 can receive all the writes
>>> including
>>> the old writes when it was down.
>>>
>>> How do I find out what was wrong withe E2 setup?
>>>
>>> I am running 3.2.2 on all nodes.
>>>
>>> Regards,
>>>
>>> Dr Hao He
>>>
>>> XPE - the truly SOA platform
>>>
>>> he@softtouchit.com
>>> http://softtouchit.com
>>>
>>>
>>
>

Mime
View raw message