cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Svihla>
Subject Re: Guidelines for configuring Thresholds for Cassandra metrics
Date Fri, 26 Aug 2016 12:31:17 GMT

Forgot the most important thing. LogsERROR you should investigateWARN you should have a list
of known ones. Use case dependent. Ideally you change configuration accordingly.*PoolCleaner
(slab or native) - good indication node is tuned badly if you see a ton of this. Set memtable_cleanup_threshold
to 0.6 as an initial attempt to configure this correctly.  This is a complex topic to dive
into, so that may not be the best number, it'll likely be better than the default, why its
not the default is a big conversation.There are a bunch of other logs I look for that are
escaping me at present but that's a good start
Ryan Svihla

On Fri, Aug 26, 2016 at 7:21 AM -0500, "Ryan Svihla" <> wrote:

Not all metrics are KPIs and are only useful when researching a specific issue or after a
use case specific threshold has been set.
The main "canaries" I monitor are:* Pending compactions (dependent on the compaction strategy
chosen but 1000 is a sign of severe issues in all cases)* dropped mutations (more than one
I treat as a event to investigate, I believe in allowing operational overhead and any evidence
of load shedding suggests I may not have as much as I thought)* blocked anything (flush writers,
etc..more than one I investigate)* system hints ( More than 1k I investigate)* heap usage
and gc time vary a lot by use case and collector chosen, I aim for below 65% usage as an average
with g1, but this again varies by use case a great deal. Sometimes I just looks the chart
and query patterns and if they don't line up I have to do other deeper investigations* read
and write latencies exceeding SLA is also use case dependent. Those that have none I tend
to push towards p99 with a middle end SSD based system having 100ms and a spindle based system
having 600ms with CL one and assuming a "typical" query pattern (again query patterns and
CL so vary here)* cell count and partition size vary greatly by hardware and gc tuning but
I like to in the absence of all other relevant information like to keep cell count for a partition
below 100k and size below 100mb. I however have many successful use cases running more and
I've had some fail well before that. Hardware and tuning tradeoff a shift this around a lot.There
is unfortunately as you'll note a lot of nuance and the load out really changes what looks
right (down to the model of SSDs I have different expectations for p99s if it's a model I
haven't used before I'll do some comparative testing).
The reason so much of this is general and vague is my selection bias. I'm brought in when
people are complaining about performance or some grand systemic crash because they were monitoring
nothing. I have little ability to change hardware initially so I have to be willing to allow
the hardware to do the best it can an establish levels where it can no longer keep up with
the customers goals. This may mean for some use cases 10 pending compactions is an actionable
event for them, for another customer 100 is. The better approach is to establish a baseline
for when these metrics start to indicate a serious issue is occurring in that particular app.
Basically when people notice a problem, what did these numbers look like in the minutes, hours
and days prior? That's the way to establish the levels consistently.
Ryan Svihla

On Fri, Aug 26, 2016 at 4:48 AM -0500, "Thomas Julian" <> wrote:


I am working on setting up a monitoring tool to monitor Cassandra Instances. Are there any
wikis which specifies optimum value for each Cassandra KPIs?
For instance, I am not sure,
What value of "Memtable Columns Count" can be considered as "Normal". 
What value of the same has to be considered as "Critical".
I knew threshold numbers for few params, for instance any thing more than zero for timeouts,
pending tasks should be considered as unusual. Also, I am aware that most of the statistics'
threshold numbers vary in accordance with Hardware Specification, Cassandra Environment Setup.
But, what I request here is a general guideline for configuring thresholds for all the metrics.

If this has been already covered, please point me to that resource. If anyone on their own
interest collected these things, please share.

Any help is appreciated.

Best Regards,

View raw message