incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: restarting node makes cpu load of the entire cluster to raise
Date Wed, 18 Jun 2014 15:23:55 GMT
Here is more info around ParNew. In these screenshots we can see 2 groups
of servers.

The lower counts and time spent in GC are servers using New size 1600M and
higher are using 400M (in cassandra-env.sh).
Every spike is corresponding to a server coming back into the ring.

Hope that this extra info will help.


2014-06-18 14:43 GMT+02:00 Alain RODRIGUEZ <arodrime@gmail.com>:

> This last command was supposed to be a best practice a few years ago, hope
> it is still the case. I just added the recent "nodetool disablebinary"
> part...
>
>
> 2014-06-18 14:36 GMT+02:00 Alain RODRIGUEZ <arodrime@gmail.com>:
>
> Thanks a lot for taking time to check the log.
>>
>> We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It
>> reduced our latency and the PARNEW GC time / second significantly...
>> (described here
>> http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
>> )
>>
>> Even when we had 400M the restart was behaving this way.
>>
>> We stop the node using : nodetool disablegossip && nodetool disablethrift
>> && nodetool disablebinary && sleep 10 && nodetool drain &&
sleep 30 &&
>> service cassandra stop
>>
>>
>> 2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jlacefield@datastax.com>:
>>
>> There are several long Parnew pauses that were recorded during startup.
>>>  The young gen size looks large too, if I am reading that line correctly.
>>>  Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
>>> size in the cassandra-env.sh?  The large you gen size, set via the env.sh
>>> file, could be causing longer than typical pauses, which could make your
>>> node appear to be unresponsive and have high CPU (CPU for the ParNew GC
>>> event).
>>>
>>> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
>>> collections, 1256307568 used; max is 8422162432
>>> That is a 2 second GC pause.  That's very high for ParNew.  We typically
>>> want a lot of tiny ParNew events as opposed to large, and less frequent,
>>> ParNew events.
>>>
>>> One other thing that was noticed, was that the node had a lot of log
>>> segment replay's during startup.  You could avoid these, or minimize them,
>>> by preforming a flush or drain before stopping and starting Cassandra.
>>>  This will flush memtables and clear your log segments.
>>>
>>>
>>>
>>> Jonathan Lacefield
>>> Solutions Architect, DataStax
>>> (404) 822 3487
>>>  <http://www.linkedin.com/in/jlacefield>
>>>
>>> <http://www.datastax.com/cassandrasummit14>
>>>
>>>
>>>
>>> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>>> wrote:
>>>
>>>> A simple restart of a node with no changes give this result.
>>>>
>>>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>>>>
>>>> Here are some screenshot:
>>>>
>>>> - htop from a node immediatly after restarting
>>>> - opscenter ring view (show load cpu on all nodes)
>>>> - opscenter dashboard shows the impact of a restart on latency (can
>>>> affect writes or reads, it depends, reaction seems to be quite random)
>>>>
>>>>
>>>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jlacefield@datastax.com>
>>>> :
>>>>
>>>> Hello
>>>>>
>>>>>   Have you checked the log file to see what's happening during startup
>>>>> ?   What caused the rolling restart?  Did you preform an upgrade or
>>>>> change a config?
>>>>>
>>>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi guys
>>>>> >
>>>>> > Using 1.2.11, when I try to rolling restart the cluster, any node
I
>>>>> restart makes the whole cluster cpu load to increase, reaching a "red"
>>>>> state in opscenter (load from 3-4 to 20+). This happens once the node
is
>>>>> back online.
>>>>> >
>>>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>>>>> mutations.
>>>>> >
>>>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>>>>> doesn't seems to help that much.
>>>>> >
>>>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>>>>> problematic I think.
>>>>> >
>>>>> > Basically, what could be happening on node restart ? What is taking
>>>>> that much CPU on every machine ? There is no steal or iowait.
>>>>> >
>>>>> > What can I try to tune ?
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message