cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <mail.list.steel.men...@gmail.com>
Subject bug when node down-up??
Date Sun, 27 Dec 2009 15:38:57 GMT
HI,guys:

 

I probably found a bug, it’s seemed on-line cluster can’t resistant rebooting of single
node, although it suppose to be.

 

suppose a cluster contained 8 nodes, which contained about 10000 rows(key range from 1 to
10000):

Address       Status     Load          Range                                      Ring

                                       170141183460469231731687303715884105728    

10.237.4.85   Up         757.13 MB     21267647932558653966460912964485513216     |<--|

10.237.1.135  Up         761.54 MB     42535295865117307932921825928971026432     |   ^

10.237.1.137  Up         748.02 MB     63802943797675961899382738893456539648     v   |

10.237.1.139  Up         732.36 MB     85070591730234615865843651857942052864     |   ^

10.237.1.140  Up         725.6 MB      106338239662793269832304564822427566080    v   |

10.237.1.141  Up         726.59 MB     127605887595351923798765477786913079296    |   ^

10.237.1.143  Up         728.16 MB     148873535527910577765226390751398592512    v   |

10.237.1.144  Up         745.69 MB     170141183460469231731687303715884105728    |-->|

 

(1)     Read keys range [1-10000], all keys read out ok ( client send read request directly
to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )

(2)     Turn-off 10.237.1.135 while remain pressure, some read request will time out,

after all nodes know 10.237.1.135 has down (about 10 s later), all read request become ok
again, that’s fine

(3)     After turn-on 10.237.1.135(and cassandra service, certainly), some read request will
time out again, and will remain FOREVER even all nodes know 10.237.1.135 has up, 

That’s a PROBLEM!

(4)     Reboot 10.237.1.135, problem remains.

(5)     If stop pressure and reboot whole cluster then perform step 1, all things are fine,
again…..

 

All read request use Quorum policy, version of Cassandra is apache-cassandra-incubating-0.5.0-beta2,
and I’ve tested apache-cassandra-incubating-0.5.0-RC1, problem remains.

 

After read system.log, I found after 10.237.1.135 down and up again, other nodes will not
establish tcp connection to it(on tcp port 7000 ) forever! 

And read request sent to 10.237.1.135(into Pending-Writes because socket channel is closed)
will not sent to net forever(from observing tcpdump).

 

It’s seems when 10.237.1.135 going down in step2, some socket channel was reset ,

after 10.237.1.135 come back, these socket channel remain closed, forever…., I don’t know….

 

Sorry for my poor English…, hope I’ve stated my problem clear.

 

---------END----------

 


Mime
View raw message