Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3874A9C73 for ; Thu, 14 Jun 2012 03:43:09 +0000 (UTC) Received: (qmail 65844 invoked by uid 500); 14 Jun 2012 03:43:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 65820 invoked by uid 500); 14 Jun 2012 03:43:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 65802 invoked by uid 99); 14 Jun 2012 03:43:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 03:43:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of al@ooyala.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 03:42:59 +0000 Received: by lbbgo11 with SMTP id go11so1996297lbb.31 for ; Wed, 13 Jun 2012 20:42:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ooyala.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YvTNLsI6Q3OH6z2OGZkRERF7Y8kKHYFX1r38JfC1cCw=; b=fBg/v4R5CEpPMoiX9D9wEqvew1U319smYTgPW0Vbs3TrZSWSdlAi4AmO2Nv9l4Q14N V2gpF6j1g6DsXvr+z8Byc/DCIMRoUmzbFZNQaZmzh5DsP8/m4hvJ9JFEkBHGbxm1hUhE wcSsKWqYPaxacqeI5r+mj3r0lVCqVyY883zvw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=YvTNLsI6Q3OH6z2OGZkRERF7Y8kKHYFX1r38JfC1cCw=; b=kpIUsYNrXw8vuLZ8zVt+2eaVPSjfkE9jxF3pyrp/LDwRW0X+4d7m2MBAm5ifIFs3vO P7bYm1L+Akod9hEgtP6jzXJpcD/KXH4hyhSGXkOPOh6AskUTzQnTSIJ/8H/WITmRNTnG yHo3I72xDd6eN7g0xUNwKDYQY8kogj8ZeLufYsT3EJRDGeUKsJRGtNkI4eWWdWLbib4P v53pcNibt1cOCOQ4eUUGtUW5BQ4KZ2ulXPCAUAREBYchcXjGKkBUyE+fNewGrdnnPZ7y w5HiJzz/r0kQV7LmMYgqqExdI0Z8Zabw/nkz5ZMqbj5bIFT8CAhRPRif+DpaRlyHKACB FCvQ== MIME-Version: 1.0 Received: by 10.152.145.42 with SMTP id sr10mr264622lab.16.1339645358611; Wed, 13 Jun 2012 20:42:38 -0700 (PDT) Received: by 10.112.103.100 with HTTP; Wed, 13 Jun 2012 20:42:38 -0700 (PDT) In-Reply-To: References: <2B0D87E1-BD48-498B-931C-9AD0D3E78EE7@thelastpickle.com> Date: Wed, 13 Jun 2012 20:42:38 -0700 Message-ID: Subject: Re: kswapd0 causing read timeouts From: Al Tobey To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e89a8f2344fdb99d6504c2667dd9 X-Gm-Message-State: ALoCoQlVNUPHGFQwHEUzRNOhXGMhvHdu6QNK8na5z5TikaJyJJ3TPfwD1qBjUakw09Z532I4qwNU --e89a8f2344fdb99d6504c2667dd9 Content-Type: text/plain; charset=UTF-8 I would check /etc/sysctl.conf and get the values of /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure. If you don't have JNA enabled (which Cassandra uses to fadvise) and swappiness is at its default of 60, the Linux kernel will happily swap out your heap for cache space. Set swappiness to 1 or 'swapoff -a' and kswapd shouldn't be doing much unless you have a too-large heap or some other app using up memory on the system. On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov wrote: > Hm, it's very strange what amount of you data? You linux kernel > version? Java version? > > PS: i can suggest switch diskaccessmode to standart in you case > PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32 > (from oracle site) > > 2012/6/13 Gurpreet Singh : > > Alright, here it goes again... > > Even with mmap_index_only, once the RES memory hit 15 gigs, the read > latency > > went berserk. This happens in 12 hours if diskaccessmode is mmap, abt 48 > hrs > > if its mmap_index_only. > > > > only reads happening at 50 reads/second > > row cache size: 730 mb, row cache hit ratio: 0.75 > > key cache size: 400 mb, key cache hit ratio: 0.4 > > heap size (max 8 gigs): used 6.1-6.9 gigs > > > > No messages about reducing cache sizes in the logs > > > > stats: > > vmstat 1 : no swapping here, however high sys cpu utilization > > iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6, > util > > = 15-30% > > top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb > > cfstats - 70-100 ms. This number used to be 20-30 ms. > > > > The value of the SHR keeps increasing (owing to mmap i guess), while at > the > > same time buffers keeps decreasing. buffers starts as high as 50 mb, and > > goes down to 2 mb. > > > > > > This is very easily reproducible for me. Every time the RES memory hits > abt > > 15 gigs, the client starts getting timeouts from cassandra, the sys cpu > > jumps a lot. All this, even though my row cache hit ratio is almost 0.75. > > > > Other than just turning off mmap completely, is there any other solution > or > > setting to avoid a cassandra restart every cpl of days. Something to keep > > the RES memory to hit such a high number. I have been constantly > monitoring > > the RES, was not seeing issues when RES was at 14 gigs. > > /G > > > > On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh < > gurpreet.singh@gmail.com> > > wrote: > >> > >> Aaron, Ruslan, > >> I changed the disk access mode to mmap_index_only, and it has been > stable > >> ever since, well at least for the past 20 hours. Previously, in abt > 10-12 > >> hours, as soon as the resident memory was full, the client would start > >> timing out on all its reads. It looks fine for now, i am going to let it > >> continue to see how long it lasts and if the problem comes again. > >> > >> Aaron, > >> yes, i had turned swap off. > >> > >> The total cpu utilization was at 700% roughly.. It looked like kswapd0 > was > >> using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite a > >> bit. top was reporting high system cpu, and low user cpu. > >> vmstat was not showing swapping. java heap size max is 8 gigs. while > only > >> 4 gigs was in use, so java heap was doing great. no gc in the logs. > iostat > >> was doing ok from what i remember, i will have to reproduce the issue > for > >> the exact numbers. > >> > >> cfstats latency had gone very high, but that is partly due to high cpu > >> usage. > >> > >> One thing was clear, that the SHR was inching higher (due to the mmap) > >> while buffer cache which started at abt 20-25mb reduced to 2 MB by the > end, > >> which probably means that pagecache was being evicted by the kswapd0. Is > >> there a way to fix the size of the buffer cache and not let system > evict it > >> in favour of mmap? > >> > >> Also, mmapping data files would basically cause not only the data (asked > >> for) to be read into main memory, but also a bunch of extra pages > >> (readahead), which would not be very useful, right? The same thing for > index > >> would actually be more useful, as there would be more index entries in > the > >> readahead part.. and the index files being small wouldnt cause memory > >> pressure that page cache would be evicted. mmapping the data files would > >> make sense if the data size is smaller than the RAM or the hot data set > is > >> smaller than the RAM, otherwise just the index would probably be a > better > >> thing to mmap, no?. In my case data size is 85 gigs, while available > RAM is > >> 16 gigs (only 8 gigs after heap). > >> > >> /G > >> > >> > >> On Fri, Jun 8, 2012 at 11:44 AM, aaron morton > >> wrote: > >>> > >>> Ruslan, > >>> Why did you suggest changing the disk_access_mode ? > >>> > >>> Gurpreet, > >>> I would leave the disk_access_mode with the default until you have a > >>> reason to change it. > >>> > >>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap configured > >>> > >>> is swap disabled ? > >>> > >>>> Gradually, > >>>> > the system cpu becomes high almost 70%, and the client starts > getting > >>>> > continuous timeouts > >>> > >>> 70% of one core or 70% of all cores ? > >>> Check the server logs, is there GC activity ? > >>> check nodetool cfstats to see the read latency for the cf. > >>> > >>> Take a look at vmstat to see if you are swapping, and look at iostats > to > >>> see if io is the problem > >>> http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html > >>> > >>> Cheers > >>> > >>> ----------------- > >>> Aaron Morton > >>> Freelance Developer > >>> @aaronmorton > >>> http://www.thelastpickle.com > >>> > >>> On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote: > >>> > >>> Thanks Ruslan. > >>> I will try the mmap_index_only. > >>> Is there any guideline as to when to leave it to auto and when to use > >>> mmap_index_only? > >>> > >>> /G > >>> > >>> On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov > > >>> wrote: > >>>> > >>>> disk_access_mode: mmap?? > >>>> > >>>> set to disk_access_mode: mmap_index_only in cassandra yaml > >>>> > >>>> 2012/6/8 Gurpreet Singh : > >>>> > Hi, > >>>> > I am testing cassandra 1.1 on a 1 node cluster. > >>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap configured > >>>> > > >>>> > cassandra 1.1.1 > >>>> > heap size: 8 gigs > >>>> > key cache size in mb: 800 (used only 200mb till now) > >>>> > memtable_total_space_in_mb : 2048 > >>>> > > >>>> > I am running a read workload.. about 30 reads/second. no writes at > >>>> > all. > >>>> > The system runs fine for roughly 12 hours. > >>>> > > >>>> > jconsole shows that my heap size has hardly touched 4 gigs. > >>>> > top shows - > >>>> > SHR increasing slowly from 100 mb to 6.6 gigs in these 12 hrs > >>>> > RES increases slowly from 6 gigs all the way to 15 gigs > >>>> > buffers are at a healthy 25 mb at some point and that goes down > to 2 > >>>> > mb in > >>>> > these 12 hrs > >>>> > VIRT stays at 85 gigs > >>>> > > >>>> > I understand that SHR goes up because of mmap, RES goes up because > it > >>>> > is > >>>> > showing SHR value as well. > >>>> > > >>>> > After around 10-12 hrs, the cpu utilization of the system starts > >>>> > increasing, > >>>> > and i notice that kswapd0 process starts becoming more active. > >>>> > Gradually, > >>>> > the system cpu becomes high almost 70%, and the client starts > getting > >>>> > continuous timeouts. The fact that the buffers went down from 20 mb > to > >>>> > 2 mb > >>>> > suggests that kswapd0 is probably swapping out the pagecache. > >>>> > > >>>> > Is there a way out of this to avoid the kswapd0 starting to do > things > >>>> > even > >>>> > when there is no swap configured? > >>>> > This is very easily reproducible for me, and would like a way out of > >>>> > this > >>>> > situation. Do i need to adjust vm memory management stuff like > >>>> > pagecache, > >>>> > vfs_cache_pressure.. things like that? > >>>> > > >>>> > just some extra information, jna is installed, mlockall is > successful. > >>>> > there > >>>> > is no compaction running. > >>>> > would appreciate any help on this. > >>>> > Thanks > >>>> > Gurpreet > >>>> > > >>>> > > >>> > >>> > >>> > >> > > > --e89a8f2344fdb99d6504c2667dd9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I would check /etc/sysctl.conf and get the values of /proc/sys/vm/swappines= s and /proc/sys/vm/vfs_cache_pressure.

If you don't = have JNA enabled (which Cassandra uses to fadvise) and swappiness is at its= default of 60, the Linux kernel will happily swap out your heap for cache = space. =C2=A0Set swappiness to 1 or 'swapoff -a' and kswapd shouldn= 't be doing much unless you have a too-large heap or some other app usi= ng up memory on the system.

On Wed, Jun 13, 2012 at 11:30 AM, ruslan usi= fov <ruslan.usifov@gmail.com> wrote:
Hm, it's very strange what amount of you data? You linux kernel
version? Java version?

PS: i can suggest switch diskaccessmode to standart in you case
PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
(from oracle site)

2012/6/13 Gurpreet Singh <gu= rpreet.singh@gmail.com>:
> Alright, here it goes again...=
> Even with mmap_index_only, once the RES memory hit 15 gigs, the read l= atency
> went berserk. This happens in 12 hours if diskaccessmode is mmap, abt = 48 hrs
> if its mmap_index_only.
>
> only reads happening at 50 reads/second
> row cache size: 730 mb, row cache hit ratio: 0.75
> key cache size: 400 mb, key cache hit ratio: 0.4
> heap size (max 8 gigs): used 6.1-6.9 gigs
>
> No messages about reducing cache sizes in the logs
>
> stats:
> vmstat 1 : no swapping here, however high sys cpu utilization
> iostat (looks great) - avg-qu-sz =3D 8, avg await =3D 7 ms, svc time = =3D 0.6, util
> =3D 15-30%
> top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
> cfstats - 70-100 ms. This number used to be 20-30 ms.
>
> The value of the SHR keeps increasing (owing to mmap i guess), while a= t the
> same time buffers keeps decreasing. buffers starts as high as 50 mb, a= nd
> goes down to 2 mb.
>
>
> This is very easily reproducible for me.=C2=A0Every time the RES memor= y hits abt
> 15 gigs, the client starts getting timeouts from cassandra, the sys cp= u
> jumps a lot. All this, even though my row cache hit ratio is almost 0.= 75.
>
> Other than just turning off mmap completely, is there any other soluti= on or
> setting to avoid a cassandra restart every cpl of days. Something to k= eep
> the RES memory to hit such a high number. I have been constantly monit= oring
> the RES, was not seeing issues when RES was at 14 gigs.
> /G
>
> On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh <gurpreet.singh@gmail.com>
> wrote:
>>
>> Aaron, Ruslan,
>> I changed the disk access mode to mmap_index_only, and it has been= stable
>> ever since, well at least for the past 20 hours.=C2=A0Previously, = in abt 10-12
>> hours, as soon as the resident memory was full, the client would s= tart
>> timing out on all its reads.=C2=A0It looks fine for now, i am goin= g to let it
>> continue to see how long it lasts and if the problem comes again.<= br> >>
>> Aaron,
>> yes, i had turned swap off.
>>
>> The total cpu utilization was at 700% roughly.. It looked like ksw= apd0 was
>> using just 1 cpu, but cassandra (jsvc) cpu utilization increased q= uite a
>> bit. top was reporting high system cpu, and low user cpu.
>> vmstat was not showing swapping. java heap size max is 8 gigs. whi= le only
>> 4 gigs was in use, so java heap was doing great. no gc in the logs= . iostat
>> was doing ok from what i remember, i will have to reproduce the is= sue for
>> the exact numbers.
>>
>> cfstats latency had gone very high, but that is partly due to high= cpu
>> usage.
>>
>> One thing was clear, that the SHR was inching higher (due to the m= map)
>> while buffer cache which started at abt 20-25mb reduced to 2 MB by= the end,
>> which probably means that pagecache was being evicted by the kswap= d0. Is
>> there a way to fix the size of the buffer cache and not let system= evict it
>> in favour of mmap?
>>
>> Also, mmapping data files would basically cause not only the data = (asked
>> for) to be read into main memory, but also a bunch of extra pages<= br> >> (readahead), which would not be very useful, right? The same thing= for index
>> would actually be more useful, as there would be more index entrie= s in the
>> readahead part.. and the index files being small wouldnt cause mem= ory
>> pressure that page cache would be evicted. mmapping the data files= would
>> make sense if the data size is smaller than the RAM or the hot dat= a set is
>> smaller than the RAM, otherwise just the index would probably be a= better
>> thing to mmap, no?. In my case data size is 85 gigs, while availab= le RAM is
>> 16 gigs (only 8 gigs after heap).
>>
>> /G
>>
>>
>> On Fri, Jun 8, 2012 at 11:44 AM, aaron morton <aaron@thelastpickle.com>
>> wrote:
>>>
>>> Ruslan,
>>> Why did you suggest changing the disk_access_mode ?
>>>
>>> Gurpreet,
>>> I would leave the disk_access_mode with the default until you = have a
>>> reason to change it.
>>>
>>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap config= ured
>>>
>>> is swap disabled ?
>>>
>>>> Gradually,
>>>> > the system cpu becomes high almost 70%, and the clien= t starts getting
>>>> > continuous timeouts
>>>
>>> 70% of one core or 70% of all cores ?
>>> Check the server logs, is there GC activity ?
>>> check nodetool cfstats to see the read latency for the cf.
>>>
>>> Take a look at vmstat to see if you are swapping, and look at = iostats to
>>> see if io is the problem
>>> http://spyced.blogspot.co.nz/2010/01/lin= ux-performance-basics.html
>>>
>>> Cheers
>>>
>>> -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> htt= p://www.thelastpickle.com
>>>
>>> On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:
>>>
>>> Thanks Ruslan.
>>> I will try the mmap_index_only.
>>> Is there any guideline as to when to leave it to auto and when= to use
>>> mmap_index_only?
>>>
>>> /G
>>>
>>> On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov <ruslan.usifov@gmail.com>
>>> wrote:
>>>>
>>>> disk_access_mode: mmap??
>>>>
>>>> set to disk_access_mode: mmap_index_only in cassandra yaml=
>>>>
>>>> 2012/6/8 Gurpreet Singh <gurpreet.singh@gmail.com>:
>>>> > Hi,
>>>> > I am testing cassandra 1.1 on a 1 node cluster.
>>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap config= ured
>>>> >
>>>> > cassandra 1.1.1
>>>> > heap size: 8 gigs
>>>> > key cache size in mb: 800 (used only 200mb till now)<= br> >>>> > memtable_total_space_in_mb : 2048
>>>> >
>>>> > I am running a read workload.. about 30 reads/second.= no writes at
>>>> > all.
>>>> > The system runs fine for roughly 12 hours.
>>>> >
>>>> > jconsole shows that my heap size has hardly touched 4= gigs.
>>>> > top shows -
>>>> > =C2=A0 SHR increasing slowly from 100 mb to 6.6 gigs = in =C2=A0these 12 hrs
>>>> > =C2=A0 RES increases slowly from 6 gigs all the way t= o 15 gigs
>>>> > =C2=A0 buffers are at a healthy 25 mb at some point a= nd that goes down to 2
>>>> > mb in
>>>> > these 12 hrs
>>>> > =C2=A0 VIRT stays at 85 gigs
>>>> >
>>>> > I understand that SHR goes up because of mmap, RES go= es up because it
>>>> > is
>>>> > showing SHR value as well.
>>>> >
>>>> > After around 10-12 hrs, the cpu utilization of the sy= stem starts
>>>> > increasing,
>>>> > and i notice that kswapd0 process starts becoming mor= e active.
>>>> > Gradually,
>>>> > the system cpu becomes high almost 70%, and the clien= t starts getting
>>>> > continuous timeouts. The fact that the buffers went d= own from 20 mb to
>>>> > 2 mb
>>>> > suggests that kswapd0 is probably swapping out the pa= gecache.
>>>> >
>>>> > Is there a way out of this to avoid the kswapd0 start= ing to do things
>>>> > even
>>>> > when there is no swap configured?
>>>> > This is very easily reproducible for me, and would li= ke a way out of
>>>> > this
>>>> > situation. Do i need to adjust vm memory management s= tuff like
>>>> > pagecache,
>>>> > vfs_cache_pressure.. things like that?
>>>> >
>>>> > just some extra information, jna is installed, mlocka= ll is successful.
>>>> > there
>>>> > is no compaction running.
>>>> > would appreciate any help on this.
>>>> > Thanks
>>>> > Gurpreet
>>>> >
>>>> >
>>>
>>>
>>>
>>
>

--e89a8f2344fdb99d6504c2667dd9--