incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: CL1 and CLQ with 5 nodes cluster and 3 alives node
Date Sun, 21 Jul 2013 20:38:57 GMT
> I'm experiencing some problems after 3 years of cassandra in production (from 
> 0.6 to 1.0.6) -- for 2 times in 3 weeks 2 nodes crashed with OutOfMemory 
> Exception.
Take a look at how many rows you have and the size of the bloom filters. You may have grown
:)

If you have more than 500Million rows you may want to check the bloom_filter_fp_chance, the
old default was 0.000744 and the new (post 1.) number is 0.01 for sized tiered. 


> Now a question -- why with 2 nodes offline all my application stop providing 
> the service, even when a Consistency Level One read is invoked?
> I'd expected this behaviour:
What error did the client get and what client are you using ? 
it also depends on if/how the node fails. The later versions try to shut down when there is
an OOM, not sure what 1.0 does. 

Is the node went into a zombie state the clients may have been timing out. The should then
move onto to another node. 
If it had started shutting down the client should have gotten some immediate errors. 

Cheers


-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/07/2013, at 5:02 PM, cbertu81@libero.it wrote:

> Hi all,
> I'm experiencing some problems after 3 years of cassandra in production (from 
> 0.6 to 1.0.6) -- for 2 times in 3 weeks 2 nodes crashed with OutOfMemory 
> Exception.
> In the log I can read the warn about the few heap available ... now I'm 
> increasing a little bit my RAM, my Java Heap (1/4 of the RAM) and reducing the 
> size of rows and memtables thresholds. Other tips?
> 
> Now a question -- why with 2 nodes offline all my application stop providing 
> the service, even when a Consistency Level One read is invoked?
> I'd expected this behaviour:
> 
> CL1 operations keep working
> more than 80% of CLQ operations working (nodes offline where 2 and 5 in a 
> clockwise key distribution only writes to fifth node should impact to node 2)
> most of all CLALL operations (that I don't use) failing
> 
> The situation instead was that I had ALL services stop responding throwing a 
> TTransportException ...
> 
> Thanks in advance
> 
> Carlo


Mime
View raw message