cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jeff.ji...@crowdstrike.com>
Subject Re: Read operations freeze for a few second while adding a new node
Date Thu, 28 Jan 2016 17:17:55 GMT
Is this during streaming plan setup (is your 10-20 second time of impact approximately 30 seconds
from the time you start the node that’s joining the ring), or does it happen for the entire
time you’re joining the node to the ring?

If so, there’s a chance it’s GC related – the streaming plan code used to instantiate
ALL of the compression metadata chunks in order to calculate, which creates a fair amount
of garbage, which creates some GC activity. https://issues.apache.org/jira/browse/CASSANDRA-10680
was created due to some edge cases (very small compression chunk size + 3T of data per node
= hundreds of millions of objects), but it’s possible that you’re seeing a less-extreme
version of that same behavior.



From:  Lorand Kasler
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, January 28, 2016 at 8:11 AM
To:  "user@cassandra.apache.org"
Subject:  Read operations freeze for a few second while adding a new node

Hi, 

We are struggling with a problem that when adding nodes around 5% read operations freeze (aka
time out after 1 second) for a few seconds (10-20 seconds). It might not seems much, but at
the order of 200k requests per second that's quite big of disruption.  It is well documented
and known that adding nodes *has* impact on the latency or the completion of the requests
but is there a way to lessen that? 
It is completely okay for write operations to fail or get blocked while adding nodes, but
having the read path also impacted by this much (going from 30 millisecond 99 percentile latency
to above 1 second) is what puzzles us.

We have a 36 node cluster, every node owning ~120 GB of data. We are using Cassandra version
2.0.14 with vnodes and we are in the process of increasing capacity of the cluster, by roughly
doubling the nodes.  They have SSDs and have peak IO usage of ~30%. 

Apart from the latency metrics only FlushWrites are blocked 18% of the time (based on the
tpstats counters), but that can only lead to blocking writes and not reads? 

Thank you 


Mime
View raw message