cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Weisberg <ar...@weisberg.ws>
Subject Re: JVM safepoints, mmap, and slow disks
Date Mon, 10 Oct 2016 18:39:55 GMT
Hi,

> That StackOverflow headline is interesting. Based on my reading of
> Hotspot's
> code, it looks like sun.misc.unsafe is used under the hood to
> perform mmapped
> I/O. I need to learn more about Hotspot's implementation before I can
>      comment
> further.
A memory mapped file is "just" memory so it's accessed using a
ByteBuffer pointing to off heap memory. Works the same as if you had
mapped in some anonymous memory.

> Not sure what you mean here. Aren't there going to be cache and TLB
> misses for any I/O, whether via mmap or syscall?
>
The beauty of memory mapped files can be that if the data is already in
the page cache it's just a regular read of memory. If you touch each
page and it's all in memory it's going to be a slower operation that
blocks the CPU as it has to synchronously load each cache line. It's
possible you might be able to touch multiple pages in parallel if you
are clever.

So if the data is the page cache and you just access it regularly
(sequentially) you get all the benefits of the prefetcher. If you go and
touch every page first you will not have the latency of prefetching
hidden from you.

> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
When you fault a page the kernel has no idea how much you are going to
read. If there is a mismatch then you may end up going back and forth to
the device several times and for spinning disk this is worse. If you
express up front what you want to read either by fadvise/madvise or a
buffered read it can do something "smart". Granted IO scheduling ranges
from middling to non-existent most of the time, and the fadvise/madvise
stuff for this has holes I can't recall right now.

Ariel

On Mon, Oct 10, 2016, at 02:19 PM, Josh Snyder wrote:
> On Sat, Oct 8, 2016 at 9:02 PM, Ariel Weisberg
> <ariel@weisberg.ws> wrote:
> ...
>
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>
> That StackOverflow headline is interesting. Based on my reading of
> Hotspot's
> code, it looks like sun.misc.unsafe is used under the hood to perform
> mmapped
> I/O. I need to learn more about Hotspot's implementation before I can
> comment
> further.
>
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB
> > misses
> > with out prefetching though.
>
> Not sure what you mean here. Aren't there going to be cache and TLB
> misses for
> any I/O, whether via mmap or syscall?
>
> > There is a system call to page the memory in which might be
> > better for
> > larger reads. Still no guarantee things stay cached though.
>
> The approaches I've seen just involve something in userspace going
> through and
> touching every desired page. It works, especially if you touch
> pages in
> parallel.
>
> Thanks for the pointers. If I get anywhere with them, I'll be sure to
> let you know.
>
> Josh
>
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven’t studied the read path that carefully, but there might be
> >> a spot at the C* level rather than JVM level where you could
> >> effectively do a JNI touch of the mmap region you’re going to need
> >> next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <graham@vast.com>
> >>> wrote:
> >>>
> >>> We don’t use Azul’s Zing, but it does have the nice feature that
> >>> all threads don’t have to reach safepoints at the same time. That
> >>> said we make heavy use of Cassandra (with off heap memtables - not
> >>> directly related but allows us a lot more GC headroom) and SOLR
> >>> where we switched to mmap because it FAR out performed pread
> >>> variants - in no cases have we noticed long time to safe point
> >>> (then again our IO is lightning fast).
> >>>
> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <jon@jonhaddad.com>
> >>>> wrote:
> >>>>
> >>>> Linux automatically uses free memory as cache.  It's not swap.
> >>>>
> >>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>
> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
> >>>> <vladyu@winguzone.com> wrote:
> >>>>> __
> >>>>> Sorry, I don't catch something. What page (memory) cache can
> >>>>> exist if there is no swap file.
> >>>>> Where are those page written/read?
> >>>>>
> >>>>>
> >>>>> Best regards, Vladimir Yudovin,
> >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> >>>>> Cassandra on Azure and SoftLayer.
> >>>>> Launch your cluster in minutes.
> > *
> >>>>>
> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
> >>>>> Weisberg<ariel@weisberg.ws>* wrote ----
> >>>>>> Hi,
> >>>>>>
> >>>>>> Nope I mean page cache. Linux doesn't call the cache it
> >>>>>> maintains using free memory a file cache. It uses free (and
> >>>>>> some of the time not so free!) memory to buffer writes and to
> >>>>>> cache recently written/read data.
> >>>>>>
> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>>>
> >>>>>> When Linux decides it needs free memory it can either evict
> >>>>>> stuff from the page cache, flush dirty pages and then evict,
or
> >>>>>> swap anonymous memory out. When you disable swap you only
> >>>>>> disable the last behavior.
> >>>>>>
> >>>>>> Maybe we are talking at cross purposes? What I meant is that
> >>>>>> increasing the heap size to reduce GC frequency is a legitimate
> >>>>>> thing to do and it does have an impact on the performance of
> >>>>>> the page cache even if you have swap disabled?
> >>>>>>
> >>>>>> Ariel
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >>>>>>> >Page cache is data pending flush to disk and data cached
from
> >>>>>>> >disk.
> >>>>>>>
> >>>>>>> Do you mean file cache?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> >>>>>>> Cassandra on Azure and SoftLayer.
> >>>>>>> Launch your cluster in minutes.*
> >>>>>>>
> >>>>>>>
> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg
> >>>>>>> <ariel@weisberg.ws>* wrote ----
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Page cache is in use even if you disable swap. Swap
is
> >>>>>>>> anonymous memory, and whatever else the Linux kernel
supports
> >>>>>>>> paging out. Page cache is data pending flush to disk
and data
> >>>>>>>> cached from disk.
> >>>>>>>>
> >>>>>>>> Given how bad the GC pauses are in C* I think it's not
the
> >>>>>>>> high pole in the tent. Until key things are off heap
and C*
> >>>>>>>> can run with CMS and get 10 millisecond GCs all day
long.
> >>>>>>>>
> >>>>>>>> You can go through tuning and hardware selection try
to get
> >>>>>>>> more consistent IO pauses and remove outliers as you
mention
> >>>>>>>> and as a user I think this is your best bet. Generally
it's
> >>>>>>>> either bad device or filesystem behavior if you get
page
> >>>>>>>> faults taking more than 200 milliseconds O(G1 gc collection).
> >>>>>>>>
> >>>>>>>> I think a JVM change to allow safe points around memory
> >>>>>>>> mapped file access is really unlikely although I agree
it
> >>>>>>>> would be great. I think the best hack around it is to
code up
> >>>>>>>> your memory mapped file access into JNI methods and
find some
> >>>>>>>> way to get that to work. Right now if you want to create
a
> >>>>>>>> safe point a JNI method is the way to do it. The problem
is
> >>>>>>>> that JNI methods and POJOs don't get along well.
> >>>>>>>>
> >>>>>>>> If you think about it the reason non-memory mapped IO
works
> >>>>>>>> well is that it's all JNI methods so they don't impact
time
> >>>>>>>> to safe point. I think there is a tradeoff between tolerance
> >>>>>>>> for outliers and performance.
> >>>>>>>>
> >>>>>>>> I don't know the state of the non-memory mapped path
and how
> >>>>>>>> reliable that is. If it were reliable and I couldn't
tolerate
> >>>>>>>> the outliers I would use that. I have to ask though,
why are
> >>>>>>>> you not able to tolerate the outliers? If you are reading
and
> >>>>>>>> writing at quorum how is this impacting you?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Ariel
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> >>>>>>>>> Hi Josh,
> >>>>>>>>>
> >>>>>>>>> >Running with increased heap size would reduce
GC frequency,
> >>>>>>>>> >at the cost of page cache.
> >>>>>>>>>
> >>>>>>>>> Actually  it's recommended to run C* without virtual
memory
> >>>>>>>>> enabled. So if there  is no enough memory JVM fails
instead
> >>>>>>>>> of blocking
> >>>>>>>>>
> >>>>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted
Cloud
> >>>>>>>>> Cassandra on Azure and SoftLayer.
> >>>>>>>>> Launch your cluster in minutes.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh
> >>>>>>>>> Snyder<josh@code406.com>* wrote ----
> >>>>>>>>>> Hello cassandra-users,
> >>>>>>>>>>
> >>>>>>>>>> I'm investigating an issue with JVMs taking
a while to
> >>>>>>>>>> reach a safepoint.  I'd
> >>>>>>>>>> like the list's input on confirming my hypothesis
and
> >>>>>>>>>> finding mitigations.
> >>>>>>>>>>
> >>>>>>>>>> My hypothesis is that slow block devices are
causing
> >>>>>>>>>> Cassandra's JVM to pause
> >>>>>>>>>> completely while attempting to reach a safepoint.
> >>>>>>>>>>
> >>>>>>>>>> Background:
> >>>>>>>>>>
> >>>>>>>>>> Hotspot occasionally performs maintenance tasks
that
> >>>>>>>>>> necessitate stopping all
> >>>>>>>>>> of its threads. Threads running JITed code occasionally
> >>>>>>>>>> read from a given
> >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint,
> >>>>>>>>>> reading from that page
> >>>>>>>>>> essentially catapults the thread into purgatory
until the
> >>>>>>>>>> safepoint completes
> >>>>>>>>>> (the mechanism behind this is pretty cool).
Threads
> >>>>>>>>>> performing syscalls or
> >>>>>>>>>> executing native code do this check upon their
return into
> >>>>>>>>>> the JVM.
> >>>>>>>>>>
> >>>>>>>>>> In this way, during the safepoint Hotspot can
be sure that
> >>>>>>>>>> all of its threads
> >>>>>>>>>> are either patiently waiting for safepoint completion
or in
> >>>>>>>>>> a system call.
> >>>>>>>>>>
> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in
normal
> >>>>>>>>>> operation. When doing
> >>>>>>>>>> mmapped reads, the JVM executes userspace code
to effect a
> >>>>>>>>>> read from a file. On
> >>>>>>>>>> the fast path (when the page needed is already
mapped into
> >>>>>>>>>> the process), this
> >>>>>>>>>> instruction is very fast. When the page is not
cached, the
> >>>>>>>>>> CPU triggers a page
> >>>>>>>>>> fault and asks the OS to go fetch the page.
The JVM doesn't
> >>>>>>>>>> even realize that
> >>>>>>>>>> anything interesting is happening: to it, the
thread is
> >>>>>>>>>> just executing a mov
> >>>>>>>>>> instruction that happens to take a while.
> >>>>>>>>>>
> >>>>>>>>>> The OS, meanwhile, puts the thread in question
in the D
> >>>>>>>>>> state (assuming Linux,
> >>>>>>>>>> here) and goes off to find the desired page.
This may take
> >>>>>>>>>> microseconds, this
> >>>>>>>>>> may take milliseconds, or it may take seconds
(or longer).
> >>>>>>>>>> When I/O occurs
> >>>>>>>>>> while the JVM is trying to enter a safepoint,
every thread
> >>>>>>>>>> has to wait for the
> >>>>>>>>>> laggard I/O to complete.
> >>>>>>>>>>
> >>>>>>>>>> If you log safepoints with the right options
[1], you can
> >>>>>>>>>> see these occurrences
> >>>>>>>>>> in the JVM output:
> >>>>>>>>>>
> >>>>>>>>>> > # SafepointSynchronize::begin: Timeout
detected:
> >>>>>>>>>> > # SafepointSynchronize::begin: Timed out
while spinning
> >>>>>>>>>> > # to reach a safepoint.
> >>>>>>>>>> > # SafepointSynchronize::begin: Threads
which did not
> >>>>>>>>>> > # reach the safepoint:
> >>>>>>>>>> > # "SharedPool-Worker-5" #468 daemon prio=5
os_prio=0
> >>>>>>>>>> > # tid=0x00007f8785bb1f30 nid=0x4e14 runnable
> >>>>>>>>>> > # [0x0000000000000000]
> >>>>>>>>>> >    java.lang.Thread.State: RUNNABLE
> >>>>>>>>>> >
> >>>>>>>>>> > # SafepointSynchronize::begin: (End of
list)
> >>>>>>>>>> >          vmop                    [threads:
total
> >>>>>>>>>> >          initially_running wait_to_block]
   [time: spin
> >>>>>>>>>> >          block sync cleanup vmop] page_trap_count
> >>>>>>>>>> > 58099.941: G1IncCollectionPause       
     [     447
> >>>>>>>>>> > 1              1    ]      [  3304    
0  3305     1
> >>>>>>>>>> > 190    ]  1
> >>>>>>>>>>
> >>>>>>>>>> If that safepoint happens to be a garbage collection
(which
> >>>>>>>>>> this one was), you
> >>>>>>>>>> can also see it in GC logs:
> >>>>>>>>>>
> >>>>>>>>>> > 2016-10-07T13:19:50.029+0000: 58103.440:
Total time for
> >>>>>>>>>> > which application threads were stopped:
3.4971808
> >>>>>>>>>> > seconds, Stopping threads took: 3.3050644
seconds
> >>>>>>>>>>
> >>>>>>>>>> In this way, JVM safepoints become a powerful
weapon for
> >>>>>>>>>> transmuting a single
> >>>>>>>>>> thread's slow I/O into the entire JVM's lockup.
> >>>>>>>>>>
> >>>>>>>>>> Does all of the above sound correct?
> >>>>>>>>>>
> >>>>>>>>>> Mitigations:
> >>>>>>>>>>
> >>>>>>>>>> 1) don't tolerate block devices that are slow
> >>>>>>>>>>
> >>>>>>>>>> This is easy in theory, and only somewhat difficult
in
> >>>>>>>>>> practice. Tools like
> >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs
of letting you
> >>>>>>>>>> know when a block
> >>>>>>>>>> device is slow.
> >>>>>>>>>>
> >>>>>>>>>> It is sad, though, because this makes running
Cassandra on
> >>>>>>>>>> mixed hardware (e.g.
> >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing.
> >>>>>>>>>>
> >>>>>>>>>> 2) have fewer safepoints
> >>>>>>>>>>
> >>>>>>>>>> Two of the biggest sources of safepoints are
garbage
> >>>>>>>>>> collection and revocation
> >>>>>>>>>> of biased locks. Evidence points toward biased
locking
> >>>>>>>>>> being unhelpful for
> >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-
> >>>>>>>>>> UseBiasedLocking) is a quick way
> >>>>>>>>>> to eliminate one source of safepoints.
> >>>>>>>>>>
> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable.
> >>>>>>>>>> Running with increased
> >>>>>>>>>> heap size would reduce GC frequency, at the
cost of page
> >>>>>>>>>> cache. But sacrificing
> >>>>>>>>>> page cache would increase page fault frequency,
which is
> >>>>>>>>>> another thing we're
> >>>>>>>>>> trying to avoid! I don't view this as a serious
option.
> >>>>>>>>>>
> >>>>>>>>>> 3) use a different IO strategy
> >>>>>>>>>>
> >>>>>>>>>> Looking at the Cassandra source code, there
appears to be
> >>>>>>>>>> an un(der)documented
> >>>>>>>>>> configuration parameter called disk_access_mode.
It appears
> >>>>>>>>>> that changing this
> >>>>>>>>>> to 'standard' would switch to using pread()
and pwrite()
> >>>>>>>>>> for I/O, instead of
> >>>>>>>>>> mmap. I imagine there would be a throughput
penalty here
> >>>>>>>>>> for the case when
> >>>>>>>>>> pages are in the disk cache.
> >>>>>>>>>>
> >>>>>>>>>> Is this a serious option? It seems far too underdocumented
> >>>>>>>>>> to be thought of as
> >>>>>>>>>> a contender.
> >>>>>>>>>>
> >>>>>>>>>> 4) modify the JVM
> >>>>>>>>>>
> >>>>>>>>>> This is a longer term option. For the purposes
of
> >>>>>>>>>> safepoints, perhaps the JVM
> >>>>>>>>>> could treat reads from an mmapped file in the
same way it
> >>>>>>>>>> treats threads that
> >>>>>>>>>> are running JNI code. That is, the safepoint
will proceed
> >>>>>>>>>> even though the
> >>>>>>>>>> reading thread has not "joined in". Upon finishing
its
> >>>>>>>>>> mmapped read, the
> >>>>>>>>>> reading thread would test the safepoint page
(check whether
> >>>>>>>>>> a safepoint is in
> >>>>>>>>>> progress, in other words).
> >>>>>>>>>>
> >>>>>>>>>> Conclusion:
> >>>>>>>>>>
> >>>>>>>>>> I don't imagine there's an easy solution here.
I plan to go
> >>>>>>>>>> ahead with
> >>>>>>>>>> mitigation #1: "don't tolerate block devices
that are
> >>>>>>>>>> slow", but I'd appreciate
> >>>>>>>>>> any approach that doesn't require my hardware
to be
> >>>>>>>>>> flawless all the time.
> >>>>>>>>>>
> >>>>>>>>>> Josh
> >>>>>>>>>>
> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100
> >>>>>>>>>> -XX:+PrintSafepointStatistics -
> >>>>>>>>>> XX:PrintSafepointStatisticsCount=1
> >>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop
> >>>>>>>>
> >>>>>>
> >> Email had 1 attachment:
> >
> >
> >>  * smime.p7s
> >>   2k (application/pkcs7-signature)

Mime
View raw message