Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <1476045106.1514211.750551033.37CBFA25@webmail.messagingengine.com>
References: <CAO5YkUTfvkQErrNMRdPxo+4rq-avLY-zophJhtyeFYrpVfmUvw@mail.gmail.com>
 <157a2a4925a.b5db542d29640.8343781003472760445@winguzone.com>
 <1475948419.268606.749851841.16DFB655@webmail.messagingengine.com>
 <157a56e9488.bd020cce38189.5513463828856809459@winguzone.com>
 <1475950190.274959.749865409.7C8BCE2A@webmail.messagingengine.com>
 <157a57f55a3.d8b8405938362.5767788485422562679@winguzone.com>
 <CACUnPaC=cwbKaxVT=mAmLemeL3tDpRixO0MrKOHrs2kXvQhi+A@mail.gmail.com>
 <82354A44-214C-45D1-ACEB-954ECCC79C3D@vast.com> <A8912B48-A283-4AEE-A5CC-09E9329BF105@vast.com>
 <1475985725.385972.750110361.052EFFB6@webmail.messagingengine.com>
 <E156ABA4-5A1B-4A60-9AC4-B0F14903FDA3@vast.com> <CACr06N0A8hdi+NdEmJaARsK7okam9eU14_WJMUP=MFEt80d-4Q@mail.gmail.com>
 <1476045106.1514211.750551033.37CBFA25@webmail.messagingengine.com>
From: Benedict Elliott Smith <benedict@apache.org>
Date: Sun, 9 Oct 2016 22:07:53 +0100
Message-ID: <CACr06N0_DEVK54wkiCa-onoCHGsex+RkXBrDNAmJtsuHUDXGEA@mail.gmail.com>
Subject: Re: JVM safepoints, mmap, and slow disks
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c1a1a867f80e2053e750941
archived-at: Sun, 09 Oct 2016 21:08:06 -0000

--94eb2c1a1a867f80e2053e750941
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Well, you seem to be assuming:

1) read ahead is done unconditionally, with an equal claim to disk resource=
s
2) read ahead is actually enabled (tuning recommendations are that it be
disabled, or at least drastically reduced, to my knowledge)
3) read ahead happens synchronously (even if you burn some bandwidth, not
waiting the increased latency for all blocks means a faster turn around to
client)

Ignoring all of this, 64kb is 1/3 default read ahead in Linux, so you're
talking a ~50% increase, which is not an amount I would readily dismiss.

On Sunday, 9 October 2016, Ariel Weisberg <ariel@weisberg.ws> wrote:

> Hi,
>
> Even with memory mapped IO the kernel is going to do read ahead. It seems
> like if the issue is reading to much from the device it isn't going to he=
lp
> to use memory mapped files or smaller buffered reads. Maybe helps by some
> percentage, but it's still going to read quite a bit extra.
>
> Ariel
>
> On Sun, Oct 9, 2016, at 05:39 AM, Benedict Elliott Smith wrote:
>
> The biggest problem with pread was the issue of over reading (reading 64k
> where 4k would suffice), which was significantly improved in 2.2 iirc. I
> don't think the penalty is very significant anymore, and if you are
> experiencing time to safe point issues it's very likely a worthwhile
> switch to flip.
>
> On Sunday, 9 October 2016, Graham Sanderson <graham@vast.com
> <javascript:_e(%7B%7D,'cvml','graham@vast.com');>> wrote:
>
> I was using the term =E2=80=9Ctouch=E2=80=9D loosely to hopefully mean pr=
e-fetch, though I
> suspect (I think intel has been de-emphasizing) you can still do a sensib=
le
> prefetch instruction in native code. Even if not you are still better
> blocking in JNI code - I haven=E2=80=99t looked at the link to see if the=
 correct
> barriers are enforced by the sun-misc-unsafe method.
>
> I do suspect that you=E2=80=99ll see up to about 5-10% sys call overhead =
if you
> hit pread.
>
> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ariel@weisberg.ws> wrote:
> >
> > Hi,
> >
> > This is starting to get into dev list territory.
> >
> > Interesting idea to touch every 4K page you are going to read.
> >
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-possible-
> to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
> >
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> > with out prefetching though.
> >
> > There is a system call to page the memory in which might be better for
> > larger reads. Still no guarantee things stay cached though.
> >
> > Ariel
> >
> >
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven=E2=80=99t studied the read path that carefully, but there migh=
t be a
> spot at the C* level rather than JVM level where you could effectively do=
 a
> JNI touch of the mmap region you=E2=80=99re going to need next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <graham@vast.com> wrote:
> >>>
> >>> We don=E2=80=99t use Azul=E2=80=99s Zing, but it does have the nice f=
eature that all
> threads don=E2=80=99t have to reach safepoints at the same time. That sai=
d we make
> heavy use of Cassandra (with off heap memtables - not directly related bu=
t
> allows us a lot more GC headroom) and SOLR where we switched to mmap
> because it FAR out performed pread variants - in no cases have we noticed
> long time to safe point (then again our IO is lightning fast).
> >>>
> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <jon@jonhaddad.com>
> wrote:
> >>>>
> >>>> Linux automatically uses free memory as cache.  It's not swap.
> >>>>
> >>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>
> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <
> vladyu@winguzone.com> wrote:
> >>>>> __
> >>>>> Sorry, I don't catch something. What page (memory) cache can exist
> if there is no swap file.
> >>>>> Where are those page written/read?
> >>>>>
> >>>>>
> >>>>> Best regards, Vladimir Yudovin,
> >>>>> *Winguzone[https://winguzone.com/?from=3Dlist] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>> Launch your cluster in minutes.
> > *
> >>>>>
> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<
> ariel@weisberg.ws>* wrote ----
> >>>>>> Hi,
> >>>>>>
> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains
> using free memory a file cache. It uses free (and some of the time not so
> free!) memory to buffer writes and to cache recently written/read data.
> >>>>>>
> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>>>
> >>>>>> When Linux decides it needs free memory it can either evict stuff
> from the page cache, flush dirty pages and then evict, or swap anonymous
> memory out. When you disable swap you only disable the last behavior.
> >>>>>>
> >>>>>> Maybe we are talking at cross purposes? What I meant is that
> increasing the heap size to reduce GC frequency is a legitimate thing to =
do
> and it does have an impact on the performance of the page cache even if y=
ou
> have swap disabled?
> >>>>>>
> >>>>>> Ariel
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >>>>>>>> Page cache is data pending flush to disk and data cached from
> disk.
> >>>>>>>
> >>>>>>> Do you mean file cache?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>> *Winguzone[https://winguzone.com/?from=3Dlist] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>>>> Launch your cluster in minutes.*
> >>>>>>>
> >>>>>>>
> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg <
> ariel@weisberg.ws>* wrote ----
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous
> memory, and whatever else the Linux kernel supports paging out. Page cach=
e
> is data pending flush to disk and data cached from disk.
> >>>>>>>>
> >>>>>>>> Given how bad the GC pauses are in C* I think it's not the high
> pole in the tent. Until key things are off heap and C* can run with CMS a=
nd
> get 10 millisecond GCs all day long.
> >>>>>>>>
> >>>>>>>> You can go through tuning and hardware selection try to get more
> consistent IO pauses and remove outliers as you mention and as a user I
> think this is your best bet. Generally it's either bad device or filesyst=
em
> behavior if you get page faults taking more than 200 milliseconds O(G1 gc
> collection).
> >>>>>>>>
> >>>>>>>> I think a JVM change to allow safe points around memory mapped
> file access is really unlikely although I agree it would be great. I thin=
k
> the best hack around it is to code up your memory mapped file access into
> JNI methods and find some way to get that to work. Right now if you want =
to
> create a safe point a JNI method is the way to do it. The problem is that
> JNI methods and POJOs don't get along well.
> >>>>>>>>
> >>>>>>>> If you think about it the reason non-memory mapped IO works well
> is that it's all JNI methods so they don't impact time to safe point. I
> think there is a tradeoff between tolerance for outliers and performance.
> >>>>>>>>
> >>>>>>>> I don't know the state of the non-memory mapped path and how
> reliable that is. If it were reliable and I couldn't tolerate the outlier=
s
> I would use that. I have to ask though, why are you not able to tolerate
> the outliers? If you are reading and writing at quorum how is this
> impacting you?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Ariel
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> >>>>>>>>> Hi Josh,
> >>>>>>>>>
> >>>>>>>>>> Running with increased heap size would reduce GC frequency, at
> the cost of page cache.
> >>>>>>>>>
> >>>>>>>>> Actually  it's recommended to run C* without virtual memory
> enabled. So if there  is no enough memory JVM fails instead of blocking
> >>>>>>>>>
> >>>>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>>>> *Winguzone[https://winguzone.com/?from=3Dlist] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>>>>>> Launch your cluster in minutes.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder<
> josh@code406.com>* wrote ----
> >>>>>>>>>> Hello cassandra-users,
> >>>>>>>>>>
> >>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a
> safepoint.  I'd
> >>>>>>>>>> like the list's input on confirming my hypothesis and finding
> mitigations.
> >>>>>>>>>>
> >>>>>>>>>> My hypothesis is that slow block devices are causing
> Cassandra's JVM to pause
> >>>>>>>>>> completely while attempting to reach a safepoint.
> >>>>>>>>>>
> >>>>>>>>>> Background:
> >>>>>>>>>>
> >>>>>>>>>> Hotspot occasionally performs maintenance tasks that
> necessitate stopping all
> >>>>>>>>>> of its threads. Threads running JITed code occasionally read
> from a given
> >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading
> from that page
> >>>>>>>>>> essentially catapults the thread into purgatory until the
> safepoint completes
> >>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing
> syscalls or
> >>>>>>>>>> executing native code do this check upon their return into the
> JVM.
> >>>>>>>>>>
> >>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all
> of its threads
> >>>>>>>>>> are either patiently waiting for safepoint completion or in a
> system call.
> >>>>>>>>>>
> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation=
.
> When doing
> >>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a rea=
d
> from a file. On
> >>>>>>>>>> the fast path (when the page needed is already mapped into the
> process), this
> >>>>>>>>>> instruction is very fast. When the page is not cached, the CPU
> triggers a page
> >>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't
> even realize that
> >>>>>>>>>> anything interesting is happening: to it, the thread is just
> executing a mov
> >>>>>>>>>> instruction that happens to take a while.
> >>>>>>>>>>
> >>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state
> (assuming Linux,
> >>>>>>>>>> here) and goes off to find the desired page. This may take
> microseconds, this
> >>>>>>>>>> may take milliseconds, or it may take seconds (or longer). Whe=
n
> I/O occurs
> >>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has
> to wait for the
> >>>>>>>>>> laggard I/O to complete.
> >>>>>>>>>>
> >>>>>>>>>> If you log safepoints with the right options [1], you can see
> these occurrences
> >>>>>>>>>> in the JVM output:
> >>>>>>>>>>
> >>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected:
> >>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to
> reach a safepoint.
> >>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach th=
e
> safepoint:
> >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=3D5 os_prio=3D0
> tid=3D0x00007f8785bb1f30 nid=3D0x4e14 runnable [0x0000000000000000]
> >>>>>>>>>>>   java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>
> >>>>>>>>>>> # SafepointSynchronize::begin: (End of list)
> >>>>>>>>>>>         vmop                    [threads: total
> initially_running wait_to_block]    [time: spin block sync cleanup vmop]
> page_trap_count
> >>>>>>>>>>> 58099.941: G1IncCollectionPause             [     447
> 1              1    ]      [  3304     0  3305     1   190    ]  1
> >>>>>>>>>>
> >>>>>>>>>> If that safepoint happens to be a garbage collection (which
> this one was), you
> >>>>>>>>>> can also see it in GC logs:
> >>>>>>>>>>
> >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which
> application threads were stopped: 3.4971808 seconds, Stopping threads too=
k:
> 3.3050644 seconds
> >>>>>>>>>>
> >>>>>>>>>> In this way, JVM safepoints become a powerful weapon for
> transmuting a single
> >>>>>>>>>> thread's slow I/O into the entire JVM's lockup.
> >>>>>>>>>>
> >>>>>>>>>> Does all of the above sound correct?
> >>>>>>>>>>
> >>>>>>>>>> Mitigations:
> >>>>>>>>>>
> >>>>>>>>>> 1) don't tolerate block devices that are slow
> >>>>>>>>>>
> >>>>>>>>>> This is easy in theory, and only somewhat difficult in
> practice. Tools like
> >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you
> know when a block
> >>>>>>>>>> device is slow.
> >>>>>>>>>>
> >>>>>>>>>> It is sad, though, because this makes running Cassandra on
> mixed hardware (e.g.
> >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing.
> >>>>>>>>>>
> >>>>>>>>>> 2) have fewer safepoints
> >>>>>>>>>>
> >>>>>>>>>> Two of the biggest sources of safepoints are garbage collectio=
n
> and revocation
> >>>>>>>>>> of biased locks. Evidence points toward biased locking being
> unhelpful for
> >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking=
)
> is a quick way
> >>>>>>>>>> to eliminate one source of safepoints.
> >>>>>>>>>>
> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running
> with increased
> >>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache=
.
> But sacrificing
> >>>>>>>>>> page cache would increase page fault frequency, which is
> another thing we're
> >>>>>>>>>> trying to avoid! I don't view this as a serious option.
> >>>>>>>>>>
> >>>>>>>>>> 3) use a different IO strategy
> >>>>>>>>>>
> >>>>>>>>>> Looking at the Cassandra source code, there appears to be an
> un(der)documented
> >>>>>>>>>> configuration parameter called disk_access_mode. It appears
> that changing this
> >>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for
> I/O, instead of
> >>>>>>>>>> mmap. I imagine there would be a throughput penalty here for
> the case when
> >>>>>>>>>> pages are in the disk cache.
> >>>>>>>>>>
> >>>>>>>>>> Is this a serious option? It seems far too underdocumented to
> be thought of as
> >>>>>>>>>> a contender.
> >>>>>>>>>>
> >>>>>>>>>> 4) modify the JVM
> >>>>>>>>>>
> >>>>>>>>>> This is a longer term option. For the purposes of safepoints,
> perhaps the JVM
> >>>>>>>>>> could treat reads from an mmapped file in the same way it
> treats threads that
> >>>>>>>>>> are running JNI code. That is, the safepoint will proceed even
> though the
> >>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped
> read, the
> >>>>>>>>>> reading thread would test the safepoint page (check whether a
> safepoint is in
> >>>>>>>>>> progress, in other words).
> >>>>>>>>>>
> >>>>>>>>>> Conclusion:
> >>>>>>>>>>
> >>>>>>>>>> I don't imagine there's an easy solution here. I plan to go
> ahead with
> >>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow",
> but I'd appreciate
> >>>>>>>>>> any approach that doesn't require my hardware to be flawless
> all the time.
> >>>>>>>>>>
> >>>>>>>>>> Josh
> >>>>>>>>>>
> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=3D100
> >>>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCo
> unt=3D1
> >>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/
> iosnoop
> >>>>>>>>
> >>>>>>
> >> Email had 1 attachment:
> >
> >
> >> * smime.p7s
> >>   2k (application/pkcs7-signature)
>
>
>

--94eb2c1a1a867f80e2053e750941
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Well,=C2=A0you seem to be=C2=A0assuming:<div><br></div><div>1)=C2=A0read ah=
ead is done unconditionally, with an equal=C2=A0claim to disk resources</di=
v><div>2) read ahead is=C2=A0actually enabled (tuning recommendations are t=
hat it be disabled, or at least drastically reduced, to my knowledge)</div>=
<div>3) read ahead=C2=A0happens synchronously (even if you burn some bandwi=
dth, not waiting the increased latency for all blocks means a=C2=A0faster t=
urn around to client)</div><div><br></div><div>Ignoring all of this,=C2=A06=
4kb is 1/3=C2=A0default read ahead in Linux, so you&#39;re talking a ~50% i=
ncrease, which is not an amount I would readily dismiss<span></span>.<br><b=
r>On Sunday, 9 October 2016, Ariel Weisberg &lt;<a href=3D"mailto:ariel@wei=
sberg.ws">ariel@weisberg.ws</a>&gt; wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><u></u>


<div><div style=3D"font-family:Arial">Hi,<br></div>
<div style=3D"font-family:Arial"><br></div>
<div style=3D"font-family:Arial"> Even with memory mapped IO the kernel is =
going to do read ahead. It seems like if the issue is reading to much from =
the device it isn&#39;t going to help to use memory mapped files or smaller=
 buffered reads. Maybe helps by some percentage, but it&#39;s still going t=
o read quite a bit extra.<br></div>
<div style=3D"font-family:Arial"><br></div>
<div style=3D"font-family:Arial">Ariel</div>
<div><br></div>
<div>On Sun, Oct 9, 2016, at 05:39 AM, Benedict Elliott Smith wrote:<br></d=
iv>
<blockquote type=3D"cite"><div style=3D"font-family:Arial">The biggest prob=
lem with pread was the issue of over reading (reading 64k where 4k=C2=A0wou=
ld suffice), which was significantly improved in 2.2 iirc.=C2=A0I don&#39;t=
 think the penalty is very significant anymore, and if you are experiencing=
 time to safe point issues it&#39;s very likely<span></span>=C2=A0a worthwh=
ile switch to flip.<br></div>
<div style=3D"font-family:Arial"><br></div>
<div style=3D"font-family:Arial">On Sunday, 9 October 2016, Graham Sanderso=
n &lt;<a href=3D"javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;graham@vast.com&#=
39;);" target=3D"_blank">graham@vast.com</a>&gt; wrote:<br></div>
<blockquote style=3D"margin-top:0px;margin-right:0px;margin-bottom:0px;marg=
in-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);bord=
er-left-style:solid;padding-left:1ex"><div style=3D"font-family:Arial">I wa=
s using the term =E2=80=9Ctouch=E2=80=9D loosely to hopefully mean pre-fetc=
h, though I suspect (I think intel has been de-emphasizing) you can still d=
o a sensible prefetch instruction in native code. Even if not you are still=
 better blocking in JNI code - I haven=E2=80=99t looked at the link to see =
if the correct barriers are enforced by the sun-misc-unsafe method.<br></di=
v>
<div style=3D"font-family:Arial"> <br></div>
<div style=3D"font-family:Arial"> I do suspect that you=E2=80=99ll see up t=
o about 5-10% sys call overhead if you hit pread.<br></div>
<div style=3D"font-family:Arial"> <br></div>
<div style=3D"font-family:Arial"> &gt; On Oct 8, 2016, at 11:02 PM, Ariel W=
eisberg &lt;<a>ariel@weisberg.ws</a>&gt; wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; Hi,<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; This is starting to get into dev lis=
t territory.<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; Interesting idea to touch every 4K p=
age you are going to read.<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; You could use this to minimize the c=
ost.<br></div>
<div style=3D"font-family:Arial"> &gt; <a href=3D"http://stackoverflow.com/=
questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-function=
s-without-jni/36309652#36309652" target=3D"_blank">http://stackoverflow.com=
/quest<wbr>ions/36298111/is-it-possible-<wbr>to-use-sun-misc-unsafe-to-<wbr=
>call-c-functions-without-jni/<wbr>36309652#36309652</a><br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; Maybe faster than doing buffered IO.=
 It&#39;s a lot of cache and TLB misses<br></div>
<div style=3D"font-family:Arial"> &gt; with out prefetching though.<br></di=
v>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; There is a system call to page the m=
emory in which might be better for<br></div>
<div style=3D"font-family:Arial"> &gt; larger reads. Still no guarantee thi=
ngs stay cached though.<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; Ariel<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt; On Sat, Oct 8, 2016, at 08:21 PM, Gr=
aham Sanderson wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt; I haven=E2=80=99t studied the re=
ad path that carefully, but there might be a spot at the C* level rather th=
an JVM level where you could effectively do a JNI touch of the mmap region =
you=E2=80=99re going to need next.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt; On Oct 8, 2016, at 7:17 PM, =
Graham Sanderson &lt;<a>graham@vast.com</a>&gt; wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt; We don=E2=80=99t use Azul=E2=
=80=99s Zing, but it does have the nice feature that all threads don=E2=80=
=99t have to reach safepoints at the same time. That said we make heavy use=
 of Cassandra (with off heap memtables - not directly related but allows us=
 a lot more GC headroom) and SOLR where we switched to mmap because it FAR =
out performed pread variants - in no cases have we noticed long time to saf=
e point (then again our IO is lightning fast).<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt; On Oct 8, 2016, at 1:20 =
PM, Jonathan Haddad &lt;<a>jon@jonhaddad.com</a>&gt; wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt; Linux automatically uses=
 free memory as cache.=C2=A0 It&#39;s not swap.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt; <a href=3D"http://www.tl=
dp.org/LDP/lki/lki-4.html" target=3D"_blank">http://www.tldp.org/LDP/lki/lk=
<wbr>i-4.html</a><br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt; On Sat, Oct 8, 2016 at 1=
1:12 AM Vladimir Yudovin &lt;<a>vladyu@winguzone.com</a>&gt; wrote:<br></di=
v>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; __<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; Sorry, I don&#39;t c=
atch something. What page (memory) cache can exist if there is no swap file=
.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; Where are those page=
 written/read?<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; Best regards, Vladim=
ir Yudovin,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; *Winguzone[<a href=
=3D"https://winguzone.com/?from=3Dlist" target=3D"_blank">https://winguzone=
.c<wbr>om/?from=3Dlist</a>] - Hosted Cloud Cassandra on Azure and SoftLayer=
.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; Launch your cluster =
in minutes.<br></div>
<div style=3D"font-family:Arial"> &gt; *<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt; ---- On Sat, 08 Oct =
2016 14:09:50 -0400 *Ariel Weisberg&lt;<a>ariel@weisberg.ws</a>&gt;* wrote =
----<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; Hi,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; Nope I mean page=
 cache. Linux doesn&#39;t call the cache it maintains using free memory a f=
ile cache. It uses free (and some of the time not so free!) memory to buffe=
r writes and to cache recently written/read data.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; <a href=3D"http:=
//www.tldp.org/LDP/lki/lki-4.html" target=3D"_blank">http://www.tldp.org/LD=
P/lki/lk<wbr>i-4.html</a><br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; When Linux decid=
es it needs free memory it can either evict stuff from the page cache, flus=
h dirty pages and then evict, or swap anonymous memory out. When you disabl=
e swap you only disable the last behavior.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; Maybe we are tal=
king at cross purposes? What I meant is that increasing the heap size to re=
duce GC frequency is a legitimate thing to do and it does have an impact on=
 the performance of the page cache even if you have swap disabled?<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; Ariel<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt; On Sat, Oct 8, 2=
016, at 01:54 PM, Vladimir Yudovin wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Page cac=
he is data pending flush to disk and data cached from disk.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt; Do you mean =
file cache?<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt; Best regards=
, Vladimir Yudovin,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt; *Winguzone[<=
a href=3D"https://winguzone.com/?from=3Dlist" target=3D"_blank">https://win=
guzone.c<wbr>om/?from=3Dlist</a>] - Hosted Cloud Cassandra on Azure and Sof=
tLayer.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt; Launch your =
cluster in minutes.*<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt; ---- On Sat,=
 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg &lt;<a>ariel@weisberg.ws</a>&gt=
;* wrote ----<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi,<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Page cac=
he is in use even if you disable swap. Swap is anonymous memory, and whatev=
er else the Linux kernel supports paging out. Page cache is data pending fl=
ush to disk and data cached from disk.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Given ho=
w bad the GC pauses are in C* I think it&#39;s not the high pole in the ten=
t. Until key things are off heap and C* can run with CMS and get 10 millise=
cond GCs all day long.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; You can =
go through tuning and hardware selection try to get more consistent IO paus=
es and remove outliers as you mention and as a user I think this is your be=
st bet. Generally it&#39;s either bad device or filesystem behavior if you =
get page faults taking more than 200 milliseconds O(G1 gc collection).<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I think =
a JVM change to allow safe points around memory mapped file access is reall=
y unlikely although I agree it would be great. I think the best hack around=
 it is to code up your memory mapped file access into JNI methods and find =
some way to get that to work. Right now if you want to create a safe point =
a JNI method is the way to do it. The problem is that JNI methods and POJOs=
 don&#39;t get along well.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; If you t=
hink about it the reason non-memory mapped IO works well is that it&#39;s a=
ll JNI methods so they don&#39;t impact time to safe point. I think there i=
s a tradeoff between tolerance for outliers and performance.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I don=
9;t know the state of the non-memory mapped path and how reliable that is. =
If it were reliable and I couldn&#39;t tolerate the outliers I would use th=
at. I have to ask though, why are you not able to tolerate the outliers? If=
 you are reading and writing at quorum how is this impacting you?<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Regards,=
<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Ariel<br=
></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On Sat, =
Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi J=
osh,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Running with increased heap size would reduce GC frequency, at the cost of =
page cache.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Actu=
ally=C2=A0 it&#39;s recommended to run C* without virtual memory enabled. S=
o if there=C2=A0 is no enough memory JVM fails instead of blocking<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Best=
 regards, Vladimir Yudovin,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; *Win=
guzone[<a href=3D"https://winguzone.com/?from=3Dlist" target=3D"_blank">htt=
ps://winguzone.c<wbr>om/?from=3Dlist</a>] - Hosted Cloud Cassandra on Azure=
 and SoftLayer.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Laun=
ch your cluster in minutes.*<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br><=
/div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ----=
 On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder&lt;<a>josh@code406.com</a>=
&gt;* wrote ----<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Hello cassandra-users,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
I&#39;m investigating an issue with JVMs taking a while to reach a safepoin=
t.=C2=A0 I&#39;d<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
like the list&#39;s input on confirming my hypothesis and finding mitigatio=
ns.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
My hypothesis is that slow block devices are causing Cassandra&#39;s JVM to=
 pause<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
completely while attempting to reach a safepoint.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Background:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Hotspot occasionally performs maintenance tasks that necessitate stopping a=
ll<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
of its threads. Threads running JITed code occasionally read from a given<b=
r></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
safepoint page. If Hotspot has initiated a safepoint, reading from that pag=
e<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
essentially catapults the thread into purgatory until the safepoint complet=
es<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
(the mechanism behind this is pretty cool). Threads performing syscalls or<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
executing native code do this check upon their return into the JVM.<br></di=
v>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
In this way, during the safepoint Hotspot can be sure that all of its threa=
ds<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
are either patiently waiting for safepoint completion or in a system call.<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Cassandra makes heavy use of mmapped reads in normal operation. When doing<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
mmapped reads, the JVM executes userspace code to effect a read from a file=
. On<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
the fast path (when the page needed is already mapped into the process), th=
is<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
instruction is very fast. When the page is not cached, the CPU triggers a p=
age<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
fault and asks the OS to go fetch the page. The JVM doesn&#39;t even realiz=
e that<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
anything interesting is happening: to it, the thread is just executing a mo=
v<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
instruction that happens to take a while.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
The OS, meanwhile, puts the thread in question in the D state (assuming Lin=
ux,<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
here) and goes off to find the desired page. This may take microseconds, th=
is<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
may take milliseconds, or it may take seconds (or longer). When I/O occurs<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
while the JVM is trying to enter a safepoint, every thread has to wait for =
the<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
laggard I/O to complete.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
If you log safepoints with the right options [1], you can see these occurre=
nces<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
in the JVM output:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; # SafepointSynchronize::begin: Timeout detected:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; # SafepointSynchronize::begin: Timed out while spinning to reach a safe=
point.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; # SafepointSynchronize::begin: Threads which did not reach the safepoin=
t:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; # &quot;SharedPool-Worker-5&quot; #468 daemon prio=3D5 os_prio=3D0 tid=
=3D0x00007f8785bb1f30 nid=3D0x4e14 runnable [0x0000000000000000]<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt;=C2=A0 =C2=A0java.lang.Thread.State: RUNNABLE<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; # SafepointSynchronize::begin: (End of list)<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vmop=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 [threads: total initially_running wait_=
to_block]=C2=A0 =C2=A0 [time: spin block sync cleanup vmop] page_trap_count=
<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; 58099.941: G1IncCollectionPause=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0[=C2=A0 =C2=A0 =C2=A0447=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1=C2=A0 =C2=A0 ]=C2=A0 =C2=A0=
 =C2=A0 [=C2=A0 3304=C2=A0 =C2=A0 =C2=A00=C2=A0 3305=C2=A0 =C2=A0 =C2=A01=
=C2=A0 =C2=A0190=C2=A0 =C2=A0 ]=C2=A0 1<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
If that safepoint happens to be a garbage collection (which this one was), =
you<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
can also see it in GC logs:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&=
gt; 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which applicati=
on threads were stopped: 3.4971808 seconds, Stopping threads took: 3.305064=
4 seconds<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
In this way, JVM safepoints become a powerful weapon for transmuting a sing=
le<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
thread&#39;s slow I/O into the entire JVM&#39;s lockup.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Does all of the above sound correct?<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Mitigations:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
1) don&#39;t tolerate block devices that are slow<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
This is easy in theory, and only somewhat difficult in practice. Tools like=
<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
perf and iosnoop [2] can do pretty good jobs of letting you know when a blo=
ck<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
device is slow.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
It is sad, though, because this makes running Cassandra on mixed hardware (=
e.g.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
fast SSD and slow disks in a JBOD) quite unappetizing.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
2) have fewer safepoints<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Two of the biggest sources of safepoints are garbage collection and revocat=
ion<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
of biased locks. Evidence points toward biased locking being unhelpful for<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Cassandra&#39;s purposes, so turning it off (-XX:-UseBiasedLocking) is a qu=
ick way<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
to eliminate one source of safepoints.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Garbage collection, on the other hand, is unavoidable. Running with increas=
ed<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
heap size would reduce GC frequency, at the cost of page cache. But sacrifi=
cing<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
page cache would increase page fault frequency, which is another thing we&#=
39;re<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
trying to avoid! I don&#39;t view this as a serious option.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
3) use a different IO strategy<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Looking at the Cassandra source code, there appears to be an un(der)documen=
ted<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
configuration parameter called disk_access_mode. It appears that changing t=
his<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
to &#39;standard&#39; would switch to using pread() and pwrite() for I/O, i=
nstead of<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
mmap. I imagine there would be a throughput penalty here for the case when<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
pages are in the disk cache.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Is this a serious option? It seems far too underdocumented to be thought of=
 as<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
a contender.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
4) modify the JVM<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
This is a longer term option. For the purposes of safepoints, perhaps the J=
VM<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
could treat reads from an mmapped file in the same way it treats threads th=
at<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
are running JNI code. That is, the safepoint will proceed even though the<b=
r></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
reading thread has not &quot;joined in&quot;. Upon finishing its mmapped re=
ad, the<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
reading thread would test the safepoint page (check whether a safepoint is =
in<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
progress, in other words).<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Conclusion:<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
I don&#39;t imagine there&#39;s an easy solution here. I plan to go ahead w=
ith<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
mitigation #1: &quot;don&#39;t tolerate block devices that are slow&quot;, =
but I&#39;d appreciate<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
any approach that doesn&#39;t require my hardware to be flawless all the ti=
me.<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
Josh<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<=
br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
[1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=3D100<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCo<wbr>unt=3D1<br=
></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; =
[2] <a href=3D"https://github.com/brendangregg/perf-tools/blob/master/iosno=
op" target=3D"_blank">https://github.com/brendangreg<wbr>g/perf-tools/blob/=
master/<wbr>iosnoop</a><br></div>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br></div=
>
<div style=3D"font-family:Arial"> &gt;&gt;&gt;&gt;&gt;&gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt; Email had 1 attachment:<br></div=
>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt;<br></div>
<div style=3D"font-family:Arial"> &gt;&gt; * smime.p7s<br></div>
<div style=3D"font-family:Arial"> &gt;&gt;=C2=A0 =C2=A02k (application/pkcs=
7-signature)<br></div>
<div style=3D"font-family:Arial"> <br></div>
</blockquote></blockquote><div style=3D"font-family:Arial"><br></div>
</div>

</blockquote></div>

--94eb2c1a1a867f80e2053e750941--