zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: question about ZK robustness
Date Wed, 01 Dec 2010 19:05:37 GMT
Chang, this is indeed a serious bug. it would be great if we could
reproduce it reliably. could you confirm the version of code you are
using. could you include enough detail that we could try to reproduce it
on our cluster?


On 12/01/2010 07:05 AM, Vishal Kher wrote:
> Agreed with Chang on all fronts. I will repro the problem and upload logs.
> 2010/12/1 Chang Song <tru64ufs@me.com>
>> I think it is not too difficult to reproduce.
>> Just create 3 node ensemble, and have some clients create ephemeral nodes.
>> And then kill one of ensemble by kill -9.
>> I don't remember it was a leader or a follower.
>> and then if you see those ephemeral nodes gone, restart the ensemble Java
>> process.
>> I think I have seen this happening twice when I continued this same
>> experiment multiple times.
>> I am not trying to create FUD around Zookeeper. Actually it is exact
>> opposite.
>> I fell in love with Zookeeper, and I still am.  I am using Zookeeper for
>> our production system.
>> In fact, it is THE only Java solution I believe in. Really.
>> I just couldn't find time to reproduce and report a bug.
>> Chang
>> Dec 1, 2010, 11:08 PM, Fournier, Camille F. [Tech] 작성:
>>> Would love to hear more about your ensemble settings to try and recreate
>> this issue. Would be a very bad thing for my deployment as well...
>>> Camille
>>> ----- Original Message -----
>>> From: Chang Song <tru64ufs@me.com>
>>> To: user@zookeeper.apache.org <user@zookeeper.apache.org>
>>> Cc: zookeeper-user@hadoop.apache.org <zookeeper-user@hadoop.apache.org>
>>> Sent: Wed Dec 01 08:09:30 2010
>>> Subject: Re: question about ZK robustness
>>> Ted.
>>> I have been inconsistency between different ensemble servers when we did
>>> some torture testing.
>>> I killed Java process with -9 on one ensemble server, and restarted it,
>> and saw
>>> that ephemeral nodes that disappeared from other two ensemble servers
>> stuck in
>>> newly restarted ensemble. No matter what I do, "create, sync, get", the
>> ephemeral
>>> nodes did not disappear.  I had to remove the log and force re-sync from
>> scratch.
>>> I had seen this behavior twice. Exactly the same behavior. I had about
>> 2000 clients connected
>>> ensemble servers. I had no time to file a bug report, but when I have
>> time to do another
>>> torture testing, I will definitely file a bug report.
>>> This is not a data loss, but a serious, dead serious inconsistency as far
>> as my application goes.
>>> Please let me know if you happened to know related bug.
>>> Thank you.
>>> Chang
>>> Dec 1, 2010, 1:41 PM, Ted Dunning 작성:
>>>> Sure.  Let me know when.  I have learned a bit more from Ben since I
>> wrote
>>>> that first bit so I could amplify the exposition
>>>> just a bit when the time comes.
>>>> On Tue, Nov 30, 2010 at 8:07 PM, Mahadev Konar <mahadev@yahoo-inc.com
>>> wrote:
>>>>> I meant to say, we can wait a while before we are done moving to the
>> new
>>>>> wiki tree.

View raw message