Your seeing dropped mutations reported from nodetool tpstats ?
Take a look at the logs. Look for messages from the MessagingService with the pattern "{}
{} messages dropped in last {}ms" They will be followed by info about the TP stats.
First would be the workload. Are you sending very big batch_mutate or multiget requests? Each
row in the requests turns into a command in the appropriate thread pool. This can result in
other requests waiting a long time for their commands to get processed.
Next would be looking for GC and checking the memtable_flush_queue_size is set high enough
(check yaml for docs).
After that I would look at winding concurrent_writers (and I assume concurrent_readers) back.
Anytime I see weirdness I look for config changes and see what happens when they are returned
to the default or near default. Do you have 16 _physical_ cores?
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
On 18/08/2012, at 10:01 AM, Guillermo Winkler <gwinkler@inconcertcc.com> wrote:
> Aaron, thanks for your answer.
>
> I'm actually tracking a problem where mutations get dropped and cfstats show no activity
whatsoever, I have 100 threads for the mutation pool, no running or pending tasks, but some
mutations get dropped none the less.
>
> I'm thinking about some scheduling problems but not really sure yet.
>
> Have you ever seen a case of dropped mutations with the system under light load?
>
> Thanks,
> Guille
>
>
> On Thu, Aug 16, 2012 at 8:22 PM, aaron morton <aaron@thelastpickle.com> wrote:
> That's some pretty old code. I would guess it was done that way to conserve resources.
And _i think_ thread creation is pretty light weight.
>
> Jonathan / Brandon / others - opinions ?
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 8:09 AM, Guillermo Winkler <gwinkler@inconcertcc.com> wrote:
>
>> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing from the
mutation pool.
>>
>> MutationStage:72031
>>
>> Where threads get created and disposed in 100's batches every few minutes, since
it's a 16 core server concurrent_writes is set in 100 in the cassandra.yaml.
>>
>> concurrent_writes: 100
>>
>> I've seen in the StageManager class this pools get created with 60 seconds keepalive
time.
>>
>> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true);
>>
>> StageManager-> public static final long KEEPALIVE = 60; // seconds to keep "extra"
threads alive for when idle
>>
>> Is it a reason for it to be this way?
>>
>> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since corePoolSize
and maxPoolSize are set at the same size?
>>
>> Thanks,
>> Guille
>>
>
>
|