cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Weisberg <ar...@weisberg.ws>
Subject Re: JVM safepoints, mmap, and slow disks
Date Sun, 09 Oct 2016 20:31:46 GMT
Hi,

 Even with memory mapped IO the kernel is going to do read ahead. It
 seems like if the issue is reading to much from the device it isn't
 going to help to use memory mapped files or smaller buffered reads.
 Maybe helps by some percentage, but it's still going to read quite a
 bit extra.

Ariel

On Sun, Oct 9, 2016, at 05:39 AM, Benedict Elliott Smith wrote:
> The biggest problem with pread was the issue of over reading (reading
> 64k where 4k would suffice), which was significantly improved in 2.2
> iirc. I don't think the penalty is very significant anymore, and if
> you are experiencing time to safe point issues it's very likely a
> worthwhile switch to flip.
>
> On Sunday, 9 October 2016, Graham Sanderson <graham@vast.com> wrote:
>> I was using the term “touch” loosely to hopefully mean pre-fetch,
>> though I suspect (I think intel has been de-emphasizing) you can
>> still do a sensible prefetch instruction in native code. Even if not
>> you are still better blocking in JNI code - I haven’t looked at the
>> link to see if the correct barriers are enforced by the sun-misc-
>> unsafe method.
>>
>>  I do suspect that you’ll see up to about 5-10% sys call overhead if
>>  you hit pread.
>>
>>  > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ariel@weisberg.ws>
>>  > wrote:
>>  >
>>  > Hi,
>>  >
>>  > This is starting to get into dev list territory.
>>  >
>>  > Interesting idea to touch every 4K page you are going to read.
>>  >
>>  > You could use this to minimize the cost.
>>  > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>>  >
>>  > Maybe faster than doing buffered IO. It's a lot of cache and TLB
>>  > misses
>>  > with out prefetching though.
>>  >
>>  > There is a system call to page the memory in which might be
>>  > better for
>>  > larger reads. Still no guarantee things stay cached though.
>>  >
>>  > Ariel
>>  >
>>  >
>>  > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>>  >> I haven’t studied the read path that carefully, but there might
>>  >> be a spot at the C* level rather than JVM level where you could
>>  >> effectively do a JNI touch of the mmap region you’re going to
>>  >> need next.
>>  >>
>>  >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <graham@vast.com>
>>  >>> wrote:
>>  >>>
>>  >>> We don’t use Azul’s Zing, but it does have the nice feature that
>>  >>> all threads don’t have to reach safepoints at the same time.
>>  >>> That said we make heavy use of Cassandra (with off heap
>>  >>> memtables - not directly related but allows us a lot more GC
>>  >>> headroom) and SOLR where we switched to mmap because it FAR out
>>  >>> performed pread variants - in no cases have we noticed long time
>>  >>> to safe point (then again our IO is lightning fast).
>>  >>>
>>  >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <jon@jonhaddad.com>
>>  >>>> wrote:
>>  >>>>
>>  >>>> Linux automatically uses free memory as cache.  It's not swap.
>>  >>>>
>>  >>>> http://www.tldp.org/LDP/lki/lki-4.html
>>  >>>>
>>  >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>>  >>>> <vladyu@winguzone.com> wrote:
>>  >>>>> __
>>  >>>>> Sorry, I don't catch something. What page (memory) cache can
>>  >>>>> exist if there is no swap file.
>>  >>>>> Where are those page written/read?
>>  >>>>>
>>  >>>>>
>>  >>>>> Best regards, Vladimir Yudovin,
>>  >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>>  >>>>> Cassandra on Azure and SoftLayer.
>>  >>>>> Launch your cluster in minutes.
>>  > *
>>  >>>>>
>>  >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
>>  >>>>> Weisberg<ariel@weisberg.ws>* wrote ----
>>  >>>>>> Hi,
>>  >>>>>>
>>  >>>>>> Nope I mean page cache. Linux doesn't call the cache it
>>  >>>>>> maintains using free memory a file cache. It uses free
(and
>>  >>>>>> some of the time not so free!) memory to buffer writes
and to
>>  >>>>>> cache recently written/read data.
>>  >>>>>>
>>  >>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>  >>>>>>
>>  >>>>>> When Linux decides it needs free memory it can either evict
>>  >>>>>> stuff from the page cache, flush dirty pages and then evict,
>>  >>>>>> or swap anonymous memory out. When you disable swap you
only
>>  >>>>>> disable the last behavior.
>>  >>>>>>
>>  >>>>>> Maybe we are talking at cross purposes? What I meant is
that
>>  >>>>>> increasing the heap size to reduce GC frequency is a
>>  >>>>>> legitimate thing to do and it does have an impact on the
>>  >>>>>> performance of the page cache even if you have swap disabled?
>>  >>>>>>
>>  >>>>>> Ariel
>>  >>>>>>
>>  >>>>>>
>>  >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>  >>>>>>>> Page cache is data pending flush to disk and data
cached
>>  >>>>>>>> from disk.
>>  >>>>>>>
>>  >>>>>>> Do you mean file cache?
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> Best regards, Vladimir Yudovin,
>>  >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted
Cloud
>>  >>>>>>> Cassandra on Azure and SoftLayer.
>>  >>>>>>> Launch your cluster in minutes.*
>>  >>>>>>>
>>  >>>>>>>
>>  >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg
>>  >>>>>>> <ariel@weisberg.ws>* wrote ----
>>  >>>>>>>> Hi,
>>  >>>>>>>>
>>  >>>>>>>> Page cache is in use even if you disable swap.
Swap is
>>  >>>>>>>> anonymous memory, and whatever else the Linux kernel
>>  >>>>>>>> supports paging out. Page cache is data pending
flush to
>>  >>>>>>>> disk and data cached from disk.
>>  >>>>>>>>
>>  >>>>>>>> Given how bad the GC pauses are in C* I think it's
not the
>>  >>>>>>>> high pole in the tent. Until key things are off
heap and C*
>>  >>>>>>>> can run with CMS and get 10 millisecond GCs all
day long.
>>  >>>>>>>>
>>  >>>>>>>> You can go through tuning and hardware selection
try to get
>>  >>>>>>>> more consistent IO pauses and remove outliers as
you
>>  >>>>>>>> mention and as a user I think this is your best
bet.
>>  >>>>>>>> Generally it's either bad device or filesystem
behavior if
>>  >>>>>>>> you get page faults taking more than 200 milliseconds
O(G1
>>  >>>>>>>> gc collection).
>>  >>>>>>>>
>>  >>>>>>>> I think a JVM change to allow safe points around
memory
>>  >>>>>>>> mapped file access is really unlikely although
I agree it
>>  >>>>>>>> would be great. I think the best hack around it
is to code
>>  >>>>>>>> up your memory mapped file access into JNI methods
and find
>>  >>>>>>>> some way to get that to work. Right now if you
want to
>>  >>>>>>>> create a safe point a JNI method is the way to
do it. The
>>  >>>>>>>> problem is that JNI methods and POJOs don't get
along well.
>>  >>>>>>>>
>>  >>>>>>>> If you think about it the reason non-memory mapped
IO works
>>  >>>>>>>> well is that it's all JNI methods so they don't
impact time
>>  >>>>>>>> to safe point. I think there is a tradeoff between
>>  >>>>>>>> tolerance for outliers and performance.
>>  >>>>>>>>
>>  >>>>>>>> I don't know the state of the non-memory mapped
path and
>>  >>>>>>>> how reliable that is. If it were reliable and I
couldn't
>>  >>>>>>>> tolerate the outliers I would use that. I have
to ask
>>  >>>>>>>> though, why are you not able to tolerate the outliers?
If
>>  >>>>>>>> you are reading and writing at quorum how is this
impacting
>>  >>>>>>>> you?
>>  >>>>>>>>
>>  >>>>>>>> Regards,
>>  >>>>>>>> Ariel
>>  >>>>>>>>
>>  >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin
wrote:
>>  >>>>>>>>> Hi Josh,
>>  >>>>>>>>>
>>  >>>>>>>>>> Running with increased heap size would
reduce GC
>>  >>>>>>>>>> frequency, at the cost of page cache.
>>  >>>>>>>>>
>>  >>>>>>>>> Actually  it's recommended to run C* without
virtual
>>  >>>>>>>>> memory enabled. So if there  is no enough memory
JVM fails
>>  >>>>>>>>> instead of blocking
>>  >>>>>>>>>
>>  >>>>>>>>> Best regards, Vladimir Yudovin,
>>  >>>>>>>>> *Winguzone[https://winguzone.com/?from=list]
- Hosted
>>  >>>>>>>>> Cloud Cassandra on Azure and SoftLayer.
>>  >>>>>>>>> Launch your cluster in minutes.*
>>  >>>>>>>>>
>>  >>>>>>>>>
>>  >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh
>>  >>>>>>>>> Snyder<josh@code406.com>* wrote ----
>>  >>>>>>>>>> Hello cassandra-users,
>>  >>>>>>>>>>
>>  >>>>>>>>>> I'm investigating an issue with JVMs taking
a while to
>>  >>>>>>>>>> reach a safepoint.  I'd
>>  >>>>>>>>>> like the list's input on confirming my
hypothesis and
>>  >>>>>>>>>> finding mitigations.
>>  >>>>>>>>>>
>>  >>>>>>>>>> My hypothesis is that slow block devices
are causing
>>  >>>>>>>>>> Cassandra's JVM to pause
>>  >>>>>>>>>> completely while attempting to reach a
safepoint.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Background:
>>  >>>>>>>>>>
>>  >>>>>>>>>> Hotspot occasionally performs maintenance
tasks that
>>  >>>>>>>>>> necessitate stopping all
>>  >>>>>>>>>> of its threads. Threads running JITed code
occasionally
>>  >>>>>>>>>> read from a given
>>  >>>>>>>>>> safepoint page. If Hotspot has initiated
a safepoint,
>>  >>>>>>>>>> reading from that page
>>  >>>>>>>>>> essentially catapults the thread into purgatory
until the
>>  >>>>>>>>>> safepoint completes
>>  >>>>>>>>>> (the mechanism behind this is pretty cool).
Threads
>>  >>>>>>>>>> performing syscalls or
>>  >>>>>>>>>> executing native code do this check upon
their return
>>  >>>>>>>>>> into the JVM.
>>  >>>>>>>>>>
>>  >>>>>>>>>> In this way, during the safepoint Hotspot
can be sure
>>  >>>>>>>>>> that all of its threads
>>  >>>>>>>>>> are either patiently waiting for safepoint
completion or
>>  >>>>>>>>>> in a system call.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Cassandra makes heavy use of mmapped reads
in normal
>>  >>>>>>>>>> operation. When doing
>>  >>>>>>>>>> mmapped reads, the JVM executes userspace
code to effect
>>  >>>>>>>>>> a read from a file. On
>>  >>>>>>>>>> the fast path (when the page needed is
already mapped
>>  >>>>>>>>>> into the process), this
>>  >>>>>>>>>> instruction is very fast. When the page
is not cached,
>>  >>>>>>>>>> the CPU triggers a page
>>  >>>>>>>>>> fault and asks the OS to go fetch the page.
The JVM
>>  >>>>>>>>>> doesn't even realize that
>>  >>>>>>>>>> anything interesting is happening: to it,
the thread is
>>  >>>>>>>>>> just executing a mov
>>  >>>>>>>>>> instruction that happens to take a while.
>>  >>>>>>>>>>
>>  >>>>>>>>>> The OS, meanwhile, puts the thread in question
in the D
>>  >>>>>>>>>> state (assuming Linux,
>>  >>>>>>>>>> here) and goes off to find the desired
page. This may
>>  >>>>>>>>>> take microseconds, this
>>  >>>>>>>>>> may take milliseconds, or it may take seconds
(or
>>  >>>>>>>>>> longer). When I/O occurs
>>  >>>>>>>>>> while the JVM is trying to enter a safepoint,
every
>>  >>>>>>>>>> thread has to wait for the
>>  >>>>>>>>>> laggard I/O to complete.
>>  >>>>>>>>>>
>>  >>>>>>>>>> If you log safepoints with the right options
[1], you can
>>  >>>>>>>>>> see these occurrences
>>  >>>>>>>>>> in the JVM output:
>>  >>>>>>>>>>
>>  >>>>>>>>>>> # SafepointSynchronize::begin: Timeout
detected:
>>  >>>>>>>>>>> # SafepointSynchronize::begin: Timed
out while spinning
>>  >>>>>>>>>>> # to reach a safepoint.
>>  >>>>>>>>>>> # SafepointSynchronize::begin: Threads
which did not
>>  >>>>>>>>>>> # reach the safepoint:
>>  >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon
prio=5 os_prio=0
>>  >>>>>>>>>>> # tid=0x00007f8785bb1f30 nid=0x4e14
runnable
>>  >>>>>>>>>>> # [0x0000000000000000]
>>  >>>>>>>>>>>   java.lang.Thread.State: RUNNABLE
>>  >>>>>>>>>>>
>>  >>>>>>>>>>> # SafepointSynchronize::begin: (End
of list)
>>  >>>>>>>>>>>         vmop                    [threads:
total
>>  >>>>>>>>>>>         initially_running wait_to_block]
   [time: spin
>>  >>>>>>>>>>>         block sync cleanup vmop] page_trap_count
>>  >>>>>>>>>>> 58099.941: G1IncCollectionPause   
         [     447
>>  >>>>>>>>>>> 1              1    ]      [  3304
    0  3305     1
>>  >>>>>>>>>>> 190    ]  1
>>  >>>>>>>>>>
>>  >>>>>>>>>> If that safepoint happens to be a garbage
collection
>>  >>>>>>>>>> (which this one was), you
>>  >>>>>>>>>> can also see it in GC logs:
>>  >>>>>>>>>>
>>  >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440:
Total time for
>>  >>>>>>>>>>> which application threads were stopped:
3.4971808
>>  >>>>>>>>>>> seconds, Stopping threads took: 3.3050644
seconds
>>  >>>>>>>>>>
>>  >>>>>>>>>> In this way, JVM safepoints become a powerful
weapon for
>>  >>>>>>>>>> transmuting a single
>>  >>>>>>>>>> thread's slow I/O into the entire JVM's
lockup.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Does all of the above sound correct?
>>  >>>>>>>>>>
>>  >>>>>>>>>> Mitigations:
>>  >>>>>>>>>>
>>  >>>>>>>>>> 1) don't tolerate block devices that are
slow
>>  >>>>>>>>>>
>>  >>>>>>>>>> This is easy in theory, and only somewhat
difficult in
>>  >>>>>>>>>> practice. Tools like
>>  >>>>>>>>>> perf and iosnoop [2] can do pretty good
jobs of letting
>>  >>>>>>>>>> you know when a block
>>  >>>>>>>>>> device is slow.
>>  >>>>>>>>>>
>>  >>>>>>>>>> It is sad, though, because this makes running
Cassandra
>>  >>>>>>>>>> on mixed hardware (e.g.
>>  >>>>>>>>>> fast SSD and slow disks in a JBOD) quite
unappetizing.
>>  >>>>>>>>>>
>>  >>>>>>>>>> 2) have fewer safepoints
>>  >>>>>>>>>>
>>  >>>>>>>>>> Two of the biggest sources of safepoints
are garbage
>>  >>>>>>>>>> collection and revocation
>>  >>>>>>>>>> of biased locks. Evidence points toward
biased locking
>>  >>>>>>>>>> being unhelpful for
>>  >>>>>>>>>> Cassandra's purposes, so turning it off
(-XX:-
>>  >>>>>>>>>> UseBiasedLocking) is a quick way
>>  >>>>>>>>>> to eliminate one source of safepoints.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Garbage collection, on the other hand,
is unavoidable.
>>  >>>>>>>>>> Running with increased
>>  >>>>>>>>>> heap size would reduce GC frequency, at
the cost of page
>>  >>>>>>>>>> cache. But sacrificing
>>  >>>>>>>>>> page cache would increase page fault frequency,
which is
>>  >>>>>>>>>> another thing we're
>>  >>>>>>>>>> trying to avoid! I don't view this as a
serious option.
>>  >>>>>>>>>>
>>  >>>>>>>>>> 3) use a different IO strategy
>>  >>>>>>>>>>
>>  >>>>>>>>>> Looking at the Cassandra source code, there
appears to be
>>  >>>>>>>>>> an un(der)documented
>>  >>>>>>>>>> configuration parameter called disk_access_mode.
It
>>  >>>>>>>>>> appears that changing this
>>  >>>>>>>>>> to 'standard' would switch to using pread()
and pwrite()
>>  >>>>>>>>>> for I/O, instead of
>>  >>>>>>>>>> mmap. I imagine there would be a throughput
penalty here
>>  >>>>>>>>>> for the case when
>>  >>>>>>>>>> pages are in the disk cache.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Is this a serious option? It seems far
too
>>  >>>>>>>>>> underdocumented to be thought of as
>>  >>>>>>>>>> a contender.
>>  >>>>>>>>>>
>>  >>>>>>>>>> 4) modify the JVM
>>  >>>>>>>>>>
>>  >>>>>>>>>> This is a longer term option. For the purposes
of
>>  >>>>>>>>>> safepoints, perhaps the JVM
>>  >>>>>>>>>> could treat reads from an mmapped file
in the same way it
>>  >>>>>>>>>> treats threads that
>>  >>>>>>>>>> are running JNI code. That is, the safepoint
will proceed
>>  >>>>>>>>>> even though the
>>  >>>>>>>>>> reading thread has not "joined in". Upon
finishing its
>>  >>>>>>>>>> mmapped read, the
>>  >>>>>>>>>> reading thread would test the safepoint
page (check
>>  >>>>>>>>>> whether a safepoint is in
>>  >>>>>>>>>> progress, in other words).
>>  >>>>>>>>>>
>>  >>>>>>>>>> Conclusion:
>>  >>>>>>>>>>
>>  >>>>>>>>>> I don't imagine there's an easy solution
here. I plan to
>>  >>>>>>>>>> go ahead with
>>  >>>>>>>>>> mitigation #1: "don't tolerate block devices
that are
>>  >>>>>>>>>> slow", but I'd appreciate
>>  >>>>>>>>>> any approach that doesn't require my hardware
to be
>>  >>>>>>>>>> flawless all the time.
>>  >>>>>>>>>>
>>  >>>>>>>>>> Josh
>>  >>>>>>>>>>
>>  >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100
>>  >>>>>>>>>> -XX:+PrintSafepointStatistics -
>>  >>>>>>>>>> XX:PrintSafepointStatisticsCount=1
>>  >>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop
>>  >>>>>>>>
>>  >>>>>>
>>  >> Email had 1 attachment:
>>  >
>>  >
>>  >> * smime.p7s
>>  >>   2k (application/pkcs7-signature)
>>

Mime
View raw message