hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <traviscrawf...@gmail.com>
Subject Re: Recovery issue - how to debug?
Date Mon, 19 Apr 2010 20:18:10 GMT
On Mon, Apr 19, 2010 at 12:10 PM, Patrick Hunt <phunt@apache.org> wrote:
>
> On 04/19/2010 11:55 AM, Travis Crawford wrote:
>>
>> To double-check, is the best way to tell a ZK instance is up-to-date
>> by looking at its ``LastZxid`` value? For example:
>>
>> $ java -jar /home/travis/cmdline-jmxclient-0.10.5.jar - localhost:8081
>>
>> org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower,name3=InMemoryDataTree
>> LastZxid
>> 04/19/2010 18:42:45 +0000 org.archive.jmx.Client LastZxid: 0xf000420ad
>>
>> I believe the ``LastZxid`` for each ZK instance needs to be compared
>> to the leader to see how far behind it is.
>
> Well the server will only be "active" once it joins the quorum (usually as a
> follower) so if it's having trouble joining that data might not be
> available. But yes, once the server is active then you could examine the
> lastzxid to determine if/howmuch it's lagging the leader (quorum).
>
>>
>>
>> It would be a lot easier from the operations perspective if the leader
>> explicitly published some health stats:
>>
>> (a) Count of instances in the ensemble.
>> (b) Count of up-to-date instances in the ensemble.
>>
>> This would greatly simplify monitoring&  alerting - when an instance
>> falls behind one could configure their monitoring system to let
>> someone know and take a look at the logs.
>
> That's a great idea. Please enter a JIRA for this - a new 4 letter word and
> JMX support. It would also be a great starter project for someone interested
> in becoming more familiar with the server code.

Filed:

    https://issues.apache.org/jira/browse/ZOOKEEPER-744

Attached is a screenshot of some JMX output in Ganglia - its currently
implemented using a -javaagent tool I happened to find. Having a
simple non-java way to fetch monitoring stats and publish to an
external monitoring system would be awesome, and probably reusable by
others.

--travis


>
> Patrick
>
>
>>
>> --travis
>>
>>
>>
>>
>> On Mon, Apr 19, 2010 at 10:14 AM, Patrick Hunt<phunt@apache.org>  wrote:
>>>
>>> Usually the server logs will shed light on such issues. If we had access
>>> to
>>> them it might be easier to speculate.
>>>
>>> Patrick
>>>
>>> On 04/19/2010 09:22 AM, Mahadev Konar wrote:
>>>>
>>>> Hi Hao,
>>>>   As Vishal already asked, how are you determining if the writes are
>>>> being
>>>> received?
>>>>  Also, what was the status of C2 when you checked for these writes? Do
>>>> you
>>>> have the output of echo "stat" | nc localhost port?
>>>>
>>>> How long did you wait when you say that C2 did not received the writes?
>>>> What
>>>> was the status of C2 (again echo "stat" | nc localhost port) when you
>>>> saw
>>>> the C2 had received the writes?
>>>>
>>>> Thanks
>>>> mahadev
>>>>
>>>>
>>>> On 4/18/10 10:54 PM, "Dr Hao He"<he@softtouchit.com>    wrote:
>>>>
>>>>> I have zookeeper cluster E1 with 3 nodes A,B, and C.
>>>>>
>>>>> I stopped C and did some writes on E1.  Both A and B received the
>>>>> writes.
>>>>>  I
>>>>> then started C and after a short while, C also received the writes.
>>>>>
>>>>> All seem to go well so I replicated the setup to another cluster E2
>>>>> with
>>>>> exactly 3 nodes: A2, B2, and C2.
>>>>>
>>>>> I stopped C2 and did some writes on E2.  A2 received the writes.  I
>>>>> then
>>>>> started C2.  However, no matter how long I wait, C2 never received the
>>>>> writes.
>>>>>
>>>>> I then did more writes on E2.  Then C2 can receive all the writes
>>>>> including
>>>>> the old writes when it was down.
>>>>>
>>>>> How do I find out what was wrong withe E2 setup?
>>>>>
>>>>> I am running 3.2.2 on all nodes.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Dr Hao He
>>>>>
>>>>> XPE - the truly SOA platform
>>>>>
>>>>> he@softtouchit.com
>>>>> http://softtouchit.com
>>>>>
>>>>>
>>>>
>>>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message