cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paco Trujillo <F.Truji...@genetwister.nl>
Subject all the nost are not reacheable when running massive deletes
Date Mon, 04 Apr 2016 12:33:52 GMT
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running "massive deletes"
on one of the nodes (via cql command line). At the beginning everything is fine, but after
a while we start getting constant NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried
for query failed (tried: /172.31.7.243:9042 (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.245:9042 (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.246:9042 (com.datastax.driver.core.exceptions.DriverException:
Timeout while trying to acquire available connection (you may want to increase the driver
number of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /172.31.7.233:9042,
/172.31.7.244:9042 [only showing errors of first 3 hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes
using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is
normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any consistency (so it
is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using
only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation
stages pending column grows without stop, could be that the problem?

I cannot find the reason that originates the timeouts. I already have increased the timeouts,
but It do not think that is a solution because the timeouts indicated another type of error.
Anyone have a tip to try to determine where is the problem?

Thanks in advance

Mime
View raw message