cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhuvan Rawal <bhu1ra...@gmail.com>
Subject Re: Node Stuck while restarting
Date Mon, 30 May 2016 12:17:41 GMT
We took backup of commitlogs and restarted the node, it started fine. As
the node was down for more than one day we can say for sure that it was
stuck and was not processing.

Wondering how we can tune our settings so as to avoid a similar scenario in
future, possibly not taking a hacky measure.

On Sun, May 29, 2016 at 7:12 PM, Bhuvan Rawal <bhu1rawal@gmail.com> wrote:

> Hi Mike,
>
> PFA the details you asked for: and some others if that helps:
> we are using jvm params
> -Xms8G
> -Xmx8G
>
> MAX_HEAP_SIZE: & HEAP_NEWSIZE: is not being set and possibly calculated
> by calculate_heap_sizes function. (And we are using default calculations):
> here are the results, pls correct me if im wrong :
> system_memory_in_mb : 64544
> system_cpu_cores : 16
>
> for MAX_HEAP_SIZE:
>
>     # set max heap size based on the following
>     # max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
>     # calculate 1/2 ram and cap to 1024MB
>     # calculate 1/4 ram and cap to 8192MB
>     # pick the max
>
> By this I can figure out that MAX_HEAP_SIZE is 8GB - (From the first case
> & third case)
>
> max_sensible_yg_per_core_in_mb="100"
> max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*"
> $system_cpu_cores` -  100* 16 = 1600 MB
> desired_yg_in_mb=`expr $max_heap_size_in_mb / 4 -------That comes out to
> be- 8GB/4 = 2GB
>
>     if [ "$desired_yg_in_mb" -gt "$max_sensible_yg_in_mb" ]
>     then
>         HEAP_NEWSIZE="${max_sensible_yg_in_mb}M"
>     else
>         HEAP_NEWSIZE="${desired_yg_in_mb}M"
>     fi
>
> That should set HEAP_NEWSIZE to 1600MB by first case.
>
>
> memtable_allocation_type: heap_buffers
>
> memtable_cleanup_threshold- we are using default:
> # memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
> # memtable_cleanup_threshold: 0.11
>
> memtable_flush_writers - default (2)
> we can increase this as we are using SSD with IOPS of around 300/s
>
> memtable_heap_space_in_mb - default values
> # memtable_heap_space_in_mb: 2048
> # memtable_offheap_space_in_mb: 2048
>
> We are using G1 garbage collector and jdk1.8.0_45
>
> Best Regards,
>
>
> On Sun, May 29, 2016 at 5:07 PM, Mike Yeap <wkk1020@gmail.com> wrote:
>
>> Hi Bhuvan, how big are your current commit logs in the failed node, and
>> what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE?
>>
>> Also the values of following properties in cassandra.yaml??
>>
>> memtable_allocation_type
>> memtable_cleanup_threshold
>> memtable_flush_writers
>> memtable_heap_space_in_mb
>> memtable_offheap_space_in_mb
>>
>>
>> Regards,
>> Mike Yeap
>>
>>
>>
>> On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal <bhu1rawal@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each.
>>> One of the node was showing UNREACHABLE on other nodes in nodetool
>>> describecluster  and on that node it was showing all others UNREACHABLE and
>>> as a measure we restarted the node.
>>>
>>> But on doing that it is stuck possibly at with these messages in
>>> system.log:
>>>
>>> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156
>>> ColumnFamilyStore.java:829 - Enqueuing flush of batches: 226784704 (11%)
>>> on-heap, 0 (0%) off-heap
>>> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 -
>>> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6,
>>> messaging version 10, compression null)
>>> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 -
>>> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap
>>>
>>> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with
>>> pending messages.
>>> This has been the status of them as per *nodetool tpstats *for long.
>>> MemtablePostFlush                 Active - 1        pending - 52
>>>   completed - 16
>>> MemtableFlushWriter               Active - 2        pending - 13
>>>   completed - 15
>>>
>>>
>>> We restarted the node by setting log level to TRACE but in vain. What
>>> could be a possible contingency plan in such a scenario?
>>>
>>> Best Regards,
>>> Bhuvan
>>>
>>>
>>
>

Mime
View raw message