cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Kivity <...@scylladb.com>
Subject Re: CS process killed by kernel OOM
Date Mon, 06 Feb 2017 13:35:00 GMT
It is a bug.  In some contexts, the kernel needs to be able to reclaim 
memory instantly, but this is not one of them.  Here, the java process 
is creating a new thread, and the kernel is allocating 16kB for its 
kernel stack; that is a regular allocation, not atomic. If you decide 
the gfp_mask value you'll see that the kernel is allowed to initiate I/O 
and perform filesystem operations to satisfy the allocation, which it 
apparently did not.


I do recommend reporting it, it will help others avoid encountering the 
same problem if it gets fixed.


On 02/06/2017 03:07 PM, Benjamin Roth wrote:
> Thanks for the reply. We got rid of the OOMs by increasing 
> vm.min_free_kbytes, it's default of approx 90mb is maybe a bit low for 
> systems with 128GB.
> I guess the OOM happens because the kernel could not reclaim enough 
> paged memory instantly.
> I can't tell if this is really a kernel bug or not. It also was my 
> first thought but in the end the main thing is, it works again and it 
> does with more mibn_free_kbytes
>
> 2017-02-06 11:53 GMT+01:00 Avi Kivity <avi@scylladb.com 
> <mailto:avi@scylladb.com>>:
>
>
>     On 01/26/2017 07:36 AM, Benjamin Roth wrote:
>>     Hi there,
>>
>>     We installed 2 new nodes these days. They run on ubuntu (Ubuntu
>>     16.04.1 LTS) with kernel 4.4.0-59-generic. On these nodes (and
>>     only on these) CS gets killed by the kernel due to OOM. It seems
>>     very strange to me because, CS only takes roughly 20GB (out of
>>     128GB), most of RAM is allocated to page cache.
>>
>>     Top looks typically like this:
>>     KiB Mem : 13191691+total,  1974964 free, 20278184 used,
>>     10966376+buff/cache
>>     KiB Swap:        0 total,        0 free,    0 used.
>>     11051503+avail Mem
>>
>>     This is what kern.log says:
>>     https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9
>>     <https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9>
>>
>>     Has anyone encountered sth like this before?
>>
>
>     2017-01-26T03:10:45.679458+00:00 cas10 kernel: [52226.449989] Node
>     0 Normal: 33850*4kB (UMEH) 8*8kB (UMH) 1*16kB (H) 0*32kB 0*64kB
>     0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 135480kB
>     2017-01-26T03:10:45.679460+00:00 cas10 kernel: [52226.449995] Node
>     1 Normal: 34213*4kB (UME) 176*8kB (UME) 0*16kB 0*32kB 0*64kB
>     0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 138260kB
>
>
>     There is plenty of free memory left (33850+34213)*4kB = 270 MB,
>     but it is fragmented into 4k and 8k blocks, while the kernel is
>     trying to allocate 16kB.  Still, the kernel could have evicted
>     some page cache or swapped out anonymous memory.  You should
>     report this to lkml, it is a kernel bug.
>
>
>
>>     -- 
>>     Benjamin Roth
>>     Prokurist
>>
>>     Jaumo GmbH · www.jaumo.com <http://www.jaumo.com>
>>     Wehrstraße 46 · 73035 Göppingen · Germany
>>     Phone +49 7161 304880-6 <tel:07161%203048806> · Fax +49 7161
>>     304880-1 <tel:07161%203048801>
>>     AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>
>
> -- 
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com <http://www.jaumo.com>
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


Mime
View raw message