cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avinash Mandava <avin...@vorstella.com>
Subject Re: Bursts of Thrift threads make cluster unresponsive
Date Thu, 27 Jun 2019 20:40:04 GMT
Yeah i skimmed too fast, don't add more work if CPU is pegged, and if using
thrift protocol NTR would not have values.

Is there an order in which the events you described happened, or is the
order with which you presented them the order you notice things going
wrong?

On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov <dimmoborgir@gmail.com>
wrote:

> Thanks for your reply!
>
> > Have you tried increasing concurrent reads until you see more activity
> in disk?
> When problem occurs, freshly created 1.2k - 2k Thrift threads consume all
> CPU on all cores.
> Does increasing concurrent reads may help in this situation?
>
> >
> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
> This metric is 0 at all cluster nodes.
>
> пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <avinash@vorstella.com>:
>
>> Have you tried increasing concurrent reads until you see more activity in
>> disk? If you've always got 32 active reads and high pending reads it could
>> just be dropping the reads because the queues are saturated. Could be
>> artificially bottlenecking at the C* process level.
>>
>> Also what does this metric show over time:
>>
>>
>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>>
>>
>>
>> On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <dimmoborgir@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> We've met several times the following problem.
>>>
>>> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
>>> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
>>> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
>>> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
>>> Method) method, count of other threads doesn't increase
>>> - some Read messages are dropped
>>> - read latency (p99.9) increases to 20-30 seconds
>>> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>>>
>>> Problem starts synchronously on all nodes of cluster.
>>> I cannot tie this problem with increased load from clients ("read rate"
>>> does't increase during the problem).
>>> Also looks like there is no problem with disks (I/O latencies are OK).
>>>
>>> Could anybody please give some advice in further troubleshooting?
>>>
>>> --
>>> Best Regards,
>>> Dmitry Simonov
>>>
>>
>>
>> --
>> www.vorstella.com
>> 408 691 8402
>>
>
>
> --
> Best Regards,
> Dmitry Simonov
>


-- 
www.vorstella.com
408 691 8402

Mime
View raw message