cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paco Trujillo <F.Truji...@genetwister.nl>
Subject RE: all the nost are not reacheable when running massive deletes
Date Tue, 05 Apr 2016 08:24:41 GMT
Hi Alain


-          Over use the cluster was one thing which I was thinking about, and I have requested
two new nodes (anyway it was something already planned). But the pattern of nodes with high
CPU load is only visible in 1 or two of the nodes, the rest are working correctly. That made
me think that adding two new nodes maybe will not help.


-          Run the deletes at slower at constant path sounds good and definitely I will try
that. Anyway I have similar errors during the weekly repair, even without the deletes running.



-          Our cluster is inhouse one, each machine ois only use as a Cassandra node.



-          Logs are quite normal, even when the timeouts start to appear on the client.



-          The update of Cassandra is a good point but I am afraid that if I start the updates
right now the timeouts problems will appear again. During an update compactions are executed?
If it is not I think is safe to update the cluster.

Thanks for your comments

From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
Sent: maandag 4 april 2016 18:35
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

Hola Paco,

the mutation stages pending column grows without stop, could be that the problem

CPU (near 96%)

Yes, basically I think you are over using this cluster.

but two of them have high cpu load, especially the 232 because I am running a lot of deletes
using cqlsh in that node.

Solutions would be to run delete at a slower & constant path, against all the nodes, using
a balancing policy or adding capacity if all the nodes are facing the issue and you can't
slow deletes. You should also have a look at iowait and steal, see if CPU are really used
100% or masking an other issue. (disk not answering fast enough or hardware / shared instance
issue). I had some noisy neighbours at some point while using Cassandra on AWS.

 I cannot find the reason that originates the timeouts.

I don't see it that weird while being overusing some/all the nodes.

I already have increased the timeouts, but It do not think that is a solution because the
timeouts indicated another type of error

Any relevant logs in Cassandra nodes (other than dropped mutations INFO)?

7 nodes version 2.0.17

Note: Be aware that this Cassandra version is quite old and no longer supported. Plus you
might face issues that were solved already. I know that upgrading is not straight forward,
but 2.0 --> 2.1 brings an amazing set of optimisations and some fixes too. You should try
it out :-).

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com<mailto:alain@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2016-04-04 14:33 GMT+02:00 Paco Trujillo <F.Trujillo@genetwister.nl<mailto:F.Trujillo@genetwister.nl>>:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes”
on one of the nodes (via cql command line). At the beginning everything is fine, but after
a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried
for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>,
/172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042>
[only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes
using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is
normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it
is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using
only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation
stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts,
but It do not think that is a solution because the timeouts indicated another type of error.
Anyone have a tip to try to determine where is the problem?

Thanks in advance

Mime
View raw message