incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramzi Rabah <rra...@playdom.com>
Subject Re: bug when node down-up??
Date Sun, 27 Dec 2009 16:43:25 GMT
I believe this is the same problem as
https://issues.apache.org/jira/browse/CASSANDRA-651



On Sun, Dec 27, 2009 at 7:38 AM,  <mail.list.steel.mental@gmail.com> wrote:
> HI,guys:
>
>
>
> I probably found a bug, it’s seemed on-line cluster can’t resistant
> rebooting of single node, although it suppose to be.
>
>
>
> suppose a cluster contained 8 nodes, which contained about 10000 rows(key
> range from 1 to 10000):
>
> Address       Status     Load
> Range                                      Ring
>
>
> 170141183460469231731687303715884105728
>
> 10.237.4.85   Up         757.13 MB
> 21267647932558653966460912964485513216     |<--|
>
> 10.237.1.135  Up         761.54 MB
> 42535295865117307932921825928971026432     |   ^
>
> 10.237.1.137  Up         748.02 MB
> 63802943797675961899382738893456539648     v   |
>
> 10.237.1.139  Up         732.36 MB
> 85070591730234615865843651857942052864     |   ^
>
> 10.237.1.140  Up         725.6 MB
> 106338239662793269832304564822427566080    v   |
>
> 10.237.1.141  Up         726.59 MB
> 127605887595351923798765477786913079296    |   ^
>
> 10.237.1.143  Up         728.16 MB
> 148873535527910577765226390751398592512    v   |
>
> 10.237.1.144  Up         745.69 MB
> 170141183460469231731687303715884105728    |-->|
>
>
>
> (1)     Read keys range [1-10000], all keys read out ok ( client send read
> request directly to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )
>
> (2)     Turn-off 10.237.1.135 while remain pressure, some read request will
> time out,
>
> after all nodes know 10.237.1.135 has down (about 10 s later), all read
> request become ok again, that’s fine
>
> (3)     After turn-on 10.237.1.135(and cassandra service, certainly), some
> read request will time out again, and will remain FOREVER even all nodes
> know 10.237.1.135 has up,
>
> That’s a PROBLEM!
>
> (4)     Reboot 10.237.1.135, problem remains.
>
> (5)     If stop pressure and reboot whole cluster then perform step 1, all
> things are fine, again…..
>
>
>
> All read request use Quorum policy, version of Cassandra is
> apache-cassandra-incubating-0.5.0-beta2, and I’ve tested
> apache-cassandra-incubating-0.5.0-RC1, problem remains.
>
>
>
> After read system.log, I found after 10.237.1.135 down and up again, other
> nodes will not establish tcp connection to it(on tcp port 7000 ) forever!
>
> And read request sent to 10.237.1.135(into Pending-Writes because socket
> channel is closed) will not sent to net forever(from observing tcpdump).
>
>
>
> It’s seems when 10.237.1.135 going down in step2, some socket channel was
> reset ,
>
> after 10.237.1.135 come back, these socket channel remain closed, forever….,
> I don’t know….
>
>
>
> Sorry for my poor English…, hope I’ve stated my problem clear.
>
>
>
> ---------END----------
>
>

Mime
View raw message