cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Weisberg <>
Subject Re: JVM safepoints, mmap, and slow disks
Date Sun, 09 Oct 2016 04:02:05 GMT

This is starting to get into dev list territory.

Interesting idea to touch every 4K page you are going to read.

You could use this to minimize the cost.

Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
with out prefetching though.

There is a system call to page the memory in which might be better for
larger reads. Still no guarantee things stay cached though.


On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> I haven’t studied the read path that carefully, but there might be a spot at the C*
level rather than JVM level where you could effectively do a JNI touch of the mmap region
you’re going to need next.
>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <> wrote:
>> We don’t use Azul’s Zing, but it does have the nice feature that all threads
don’t have to reach safepoints at the same time. That said we make heavy use of Cassandra
(with off heap memtables - not directly related but allows us a lot more GC headroom) and
SOLR where we switched to mmap because it FAR out performed pread variants - in no cases have
we noticed long time to safe point (then again our IO is lightning fast).
>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <> wrote:
>>> Linux automatically uses free memory as cache.  It's not swap.
>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <>
>>>> __
>>>> Sorry, I don't catch something. What page (memory) cache can exist if there
is no swap file.
>>>> Where are those page written/read?
>>>> Best regards, Vladimir Yudovin, 
>>>> *Winguzone[] - Hosted Cloud Cassandra on
Azure and SoftLayer.
>>>> Launch your cluster in minutes.
>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<>*
wrote ---- 
>>>>> Hi,
>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using
free memory a file cache. It uses free (and some of the time not so free!) memory to buffer
writes and to cache recently written/read data.
>>>>> When Linux decides it needs free memory it can either evict stuff from
the page cache, flush dirty pages and then evict, or swap anonymous memory out. When you disable
swap you only disable the last behavior.
>>>>> Maybe we are talking at cross purposes? What I meant is that increasing
the heap size to reduce GC frequency is a legitimate thing to do and it does have an impact
on the performance of the page cache even if you have swap disabled?
>>>>> Ariel
>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>>>>> >Page cache is data pending flush to disk and data cached from
>>>>>> Do you mean file cache?
>>>>>> Best regards, Vladimir Yudovin, 
>>>>>> *Winguzone[] - Hosted Cloud Cassandra
on Azure and SoftLayer.
>>>>>> Launch your cluster in minutes.*
>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg <>*
wrote ---- 
>>>>>>> Hi,
>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous
memory, and whatever else the Linux kernel supports paging out. Page cache is data pending
flush to disk and data cached from disk.
>>>>>>> Given how bad the GC pauses are in C* I think it's not the high
pole in the tent. Until key things are off heap and C* can run with CMS and get 10 millisecond
GCs all day long.
>>>>>>> You can go through tuning and hardware selection try to get more
consistent IO pauses and remove outliers as you mention and as a user I think this is your
best bet. Generally it's either bad device or filesystem behavior if you get page faults taking
more than 200 milliseconds O(G1 gc collection).
>>>>>>> I think a JVM change to allow safe points around memory mapped
file access is really unlikely although I agree it would be great. I think the best hack around
it is to code up your memory mapped file access into JNI methods and find some way to get
that to work. Right now if you want to create a safe point a JNI method is the way to do it.
The problem is that JNI methods and POJOs don't get along well.
>>>>>>> If you think about it the reason non-memory mapped IO works well
is that it's all JNI methods so they don't impact time to safe point. I think there is a tradeoff
between tolerance for outliers and performance.
>>>>>>> I don't know the state of the non-memory mapped path and how
reliable that is. If it were reliable and I couldn't tolerate the outliers I would use that.
I have to ask though, why are you not able to tolerate the outliers? If you are reading and
writing at quorum how is this impacting you?
>>>>>>> Regards,
>>>>>>> Ariel
>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>>>>>>>> Hi Josh,
>>>>>>>> >Running with increased heap size would reduce GC frequency,
at the cost of page cache.
>>>>>>>> Actually  it's recommended to run C* without virtual memory
enabled. So if there  is no enough memory JVM fails instead of blocking
>>>>>>>> Best regards, Vladimir Yudovin, 
>>>>>>>> *Winguzone[] - Hosted Cloud
Cassandra on Azure and SoftLayer.
>>>>>>>> Launch your cluster in minutes.*
>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder<>*
wrote ---- 
>>>>>>>>> Hello cassandra-users, 
>>>>>>>>> I'm investigating an issue with JVMs taking a while to
reach a safepoint.  I'd 
>>>>>>>>> like the list's input on confirming my hypothesis and
finding mitigations. 
>>>>>>>>> My hypothesis is that slow block devices are causing
Cassandra's JVM to pause 
>>>>>>>>> completely while attempting to reach a safepoint. 
>>>>>>>>> Background: 
>>>>>>>>> Hotspot occasionally performs maintenance tasks that
necessitate stopping all 
>>>>>>>>> of its threads. Threads running JITed code occasionally
read from a given 
>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint,
reading from that page 
>>>>>>>>> essentially catapults the thread into purgatory until
the safepoint completes 
>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing
syscalls or 
>>>>>>>>> executing native code do this check upon their return
into the JVM. 
>>>>>>>>> In this way, during the safepoint Hotspot can be sure
that all of its threads 
>>>>>>>>> are either patiently waiting for safepoint completion
or in a system call. 
>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal
operation. When doing 
>>>>>>>>> mmapped reads, the JVM executes userspace code to effect
a read from a file. On 
>>>>>>>>> the fast path (when the page needed is already mapped
into the process), this 
>>>>>>>>> instruction is very fast. When the page is not cached,
the CPU triggers a page 
>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't
even realize that 
>>>>>>>>> anything interesting is happening: to it, the thread
is just executing a mov 
>>>>>>>>> instruction that happens to take a while. 
>>>>>>>>> The OS, meanwhile, puts the thread in question in the
D state (assuming Linux, 
>>>>>>>>> here) and goes off to find the desired page. This may
take microseconds, this 
>>>>>>>>> may take milliseconds, or it may take seconds (or longer).
When I/O occurs 
>>>>>>>>> while the JVM is trying to enter a safepoint, every thread
has to wait for the 
>>>>>>>>> laggard I/O to complete. 
>>>>>>>>> If you log safepoints with the right options [1], you
can see these occurrences 
>>>>>>>>> in the JVM output: 
>>>>>>>>> > # SafepointSynchronize::begin: Timeout detected:

>>>>>>>>> > # SafepointSynchronize::begin: Timed out while spinning
to reach a safepoint. 
>>>>>>>>> > # SafepointSynchronize::begin: Threads which did
not reach the safepoint: 
>>>>>>>>> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0
tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] 
>>>>>>>>> >    java.lang.Thread.State: RUNNABLE 
>>>>>>>>> > 
>>>>>>>>> > # SafepointSynchronize::begin: (End of list) 
>>>>>>>>> >          vmop                    [threads: total
initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count 
>>>>>>>>> > 58099.941: G1IncCollectionPause             [  
  447          1              1    ]      [  3304     0  3305     1   190    ]  1 
>>>>>>>>> If that safepoint happens to be a garbage collection
(which this one was), you 
>>>>>>>>> can also see it in GC logs: 
>>>>>>>>> > 2016-10-07T13:19:50.029+0000: 58103.440: Total time
for which application threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644
>>>>>>>>> In this way, JVM safepoints become a powerful weapon
for transmuting a single 
>>>>>>>>> thread's slow I/O into the entire JVM's lockup. 
>>>>>>>>> Does all of the above sound correct? 
>>>>>>>>> Mitigations: 
>>>>>>>>> 1) don't tolerate block devices that are slow 
>>>>>>>>> This is easy in theory, and only somewhat difficult in
practice. Tools like 
>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting
you know when a block 
>>>>>>>>> device is slow. 
>>>>>>>>> It is sad, though, because this makes running Cassandra
on mixed hardware (e.g. 
>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing.

>>>>>>>>> 2) have fewer safepoints 
>>>>>>>>> Two of the biggest sources of safepoints are garbage
collection and revocation 
>>>>>>>>> of biased locks. Evidence points toward biased locking
being unhelpful for 
>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking)
is a quick way 
>>>>>>>>> to eliminate one source of safepoints. 
>>>>>>>>> Garbage collection, on the other hand, is unavoidable.
Running with increased 
>>>>>>>>> heap size would reduce GC frequency, at the cost of page
cache. But sacrificing 
>>>>>>>>> page cache would increase page fault frequency, which
is another thing we're 
>>>>>>>>> trying to avoid! I don't view this as a serious option.

>>>>>>>>> 3) use a different IO strategy 
>>>>>>>>> Looking at the Cassandra source code, there appears to
be an un(der)documented 
>>>>>>>>> configuration parameter called disk_access_mode. It appears
that changing this 
>>>>>>>>> to 'standard' would switch to using pread() and pwrite()
for I/O, instead of 
>>>>>>>>> mmap. I imagine there would be a throughput penalty here
for the case when 
>>>>>>>>> pages are in the disk cache. 
>>>>>>>>> Is this a serious option? It seems far too underdocumented
to be thought of as 
>>>>>>>>> a contender. 
>>>>>>>>> 4) modify the JVM 
>>>>>>>>> This is a longer term option. For the purposes of safepoints,
perhaps the JVM 
>>>>>>>>> could treat reads from an mmapped file in the same way
it treats threads that 
>>>>>>>>> are running JNI code. That is, the safepoint will proceed
even though the 
>>>>>>>>> reading thread has not "joined in". Upon finishing its
mmapped read, the 
>>>>>>>>> reading thread would test the safepoint page (check whether
a safepoint is in 
>>>>>>>>> progress, in other words). 
>>>>>>>>> Conclusion: 
>>>>>>>>> I don't imagine there's an easy solution here. I plan
to go ahead with 
>>>>>>>>> mitigation #1: "don't tolerate block devices that are
slow", but I'd appreciate 
>>>>>>>>> any approach that doesn't require my hardware to be flawless
all the time. 
>>>>>>>>> Josh 
>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100

>>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1

>>>>>>>>> [2]

> Email had 1 attachment:

>  * smime.p7s
>   2k (application/pkcs7-signature)

View raw message