cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Yudovin <vla...@winguzone.com>
Subject Re: JVM safepoints, mmap, and slow disks
Date Sat, 08 Oct 2016 18:12:48 GMT
Sorry, I don't catch something. What page (memory) cache can exist if there is no swap file.
Where are those page written/read?

Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




---- On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg&lt;ariel@weisberg.ws&gt; wrote
---- 

Hi,

 

 Nope I mean page cache. Linux doesn't call the cache it maintains using free memory a file
cache. It uses free (and some of the time not so free!) memory to buffer writes and to cache
recently written/read data.

 

 http://www.tldp.org/LDP/lki/lki-4.html

 

 When Linux decides it needs free memory it can either evict stuff from the page cache, flush
dirty pages and then evict, or swap anonymous memory out. When you disable swap you only disable
the last behavior.

 

 Maybe we are talking at cross purposes? What I meant is that increasing the heap size to
reduce GC frequency is a legitimate thing to do and it does have an impact on the performance
of the page cache even if you have swap disabled?

 

 Ariel

 

 

 On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:

 &gt;Page cache is data pending flush to disk and data cached from disk.

 

 Do you mean file cache?

 

 

 Best regards, Vladimir Yudovin, 

 Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.

 
 

 

 ---- On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg &lt;ariel@weisberg.ws&gt;
wrote ---- 

 
 Hi,

 

 Page cache is in use even if you disable swap. Swap is anonymous memory, and whatever else
the Linux kernel supports paging out. Page cache is data pending flush to disk and data cached
from disk.

 

 Given how bad the GC pauses are in C* I think it's not the high pole in the tent. Until key
things are off heap and C* can run with CMS and get 10 millisecond GCs all day long.

 

 You can go through tuning and hardware selection try to get more consistent IO pauses and
remove outliers as you mention and as a user I think this is your best bet. Generally it's
either bad device or filesystem behavior if you get page faults taking more than 200 milliseconds
O(G1 gc collection).

 

 I think a JVM change to allow safe points around memory mapped file access is really unlikely
although I agree it would be great. I think the best hack around it is to code up your memory
mapped file access into JNI methods and find some way to get that to work. Right now if you
want to create a safe point a JNI method is the way to do it. The problem is that JNI methods
and POJOs don't get along well.

 

 If you think about it the reason non-memory mapped IO works well is that it's all JNI methods
so they don't impact time to safe point. I think there is a tradeoff between tolerance for
outliers and performance.

 

 I don't know the state of the non-memory mapped path and how reliable that is. If it were
reliable and I couldn't tolerate the outliers I would use that. I have to ask though, why
are you not able to tolerate the outliers? If you are reading and writing at quorum how is
this impacting you?

 

 Regards,

 Ariel

 

 On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:

 Hi Josh,

 

 &gt;Running with increased heap size would reduce GC frequency, at the cost of page cache.

 

 Actually it's recommended to run C* without virtual memory enabled. So if there is no enough
memory JVM fails instead of blocking

 

 Best regards, Vladimir Yudovin, 

 Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.
 
 

 

 ---- On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder&lt;josh@code406.com&gt; wrote
---- 

 
 Hello cassandra-users, 

 

 I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 

 like the list's input on confirming my hypothesis and finding mitigations. 

 

 My hypothesis is that slow block devices are causing Cassandra's JVM to pause 

 completely while attempting to reach a safepoint. 

 

 Background: 

 

 Hotspot occasionally performs maintenance tasks that necessitate stopping all 

 of its threads. Threads running JITed code occasionally read from a given 

 safepoint page. If Hotspot has initiated a safepoint, reading from that page 

 essentially catapults the thread into purgatory until the safepoint completes 

 (the mechanism behind this is pretty cool). Threads performing syscalls or 

 executing native code do this check upon their return into the JVM. 

 

 In this way, during the safepoint Hotspot can be sure that all of its threads 

 are either patiently waiting for safepoint completion or in a system call. 

 

 Cassandra makes heavy use of mmapped reads in normal operation. When doing 

 mmapped reads, the JVM executes userspace code to effect a read from a file. On 

 the fast path (when the page needed is already mapped into the process), this 

 instruction is very fast. When the page is not cached, the CPU triggers a page 

 fault and asks the OS to go fetch the page. The JVM doesn't even realize that 

 anything interesting is happening: to it, the thread is just executing a mov 

 instruction that happens to take a while. 

 

 The OS, meanwhile, puts the thread in question in the D state (assuming Linux, 

 here) and goes off to find the desired page. This may take microseconds, this 

 may take milliseconds, or it may take seconds (or longer). When I/O occurs 

 while the JVM is trying to enter a safepoint, every thread has to wait for the 

 laggard I/O to complete. 

 

 If you log safepoints with the right options [1], you can see these occurrences 

 in the JVM output: 

 

 &gt; # SafepointSynchronize::begin: Timeout detected: 

 &gt; # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint. 

 &gt; # SafepointSynchronize::begin: Threads which did not reach the safepoint: 

 &gt; # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 tid=0x00007f8785bb1f30 nid=0x4e14
runnable [0x0000000000000000] 

 &gt; java.lang.Thread.State: RUNNABLE 

 &gt; 

 &gt; # SafepointSynchronize::begin: (End of list) 

 &gt; vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup
vmop] page_trap_count 

 &gt; 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 

 

 If that safepoint happens to be a garbage collection (which this one was), you 

 can also see it in GC logs: 

 

 &gt; 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which application threads
were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 seconds 

 

 In this way, JVM safepoints become a powerful weapon for transmuting a single 

 thread's slow I/O into the entire JVM's lockup. 

 

 Does all of the above sound correct? 

 

 Mitigations: 

 

 1) don't tolerate block devices that are slow 

 

 This is easy in theory, and only somewhat difficult in practice. Tools like 

 perf and iosnoop [2] can do pretty good jobs of letting you know when a block 

 device is slow. 

 

 It is sad, though, because this makes running Cassandra on mixed hardware (e.g. 

 fast SSD and slow disks in a JBOD) quite unappetizing. 

 

 2) have fewer safepoints 

 

 Two of the biggest sources of safepoints are garbage collection and revocation 

 of biased locks. Evidence points toward biased locking being unhelpful for 

 Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick way 

 to eliminate one source of safepoints. 

 

 Garbage collection, on the other hand, is unavoidable. Running with increased 

 heap size would reduce GC frequency, at the cost of page cache. But sacrificing 

 page cache would increase page fault frequency, which is another thing we're 

 trying to avoid! I don't view this as a serious option. 

 

 3) use a different IO strategy 

 

 Looking at the Cassandra source code, there appears to be an un(der)documented 

 configuration parameter called disk_access_mode. It appears that changing this 

 to 'standard' would switch to using pread() and pwrite() for I/O, instead of 

 mmap. I imagine there would be a throughput penalty here for the case when 

 pages are in the disk cache. 

 

 Is this a serious option? It seems far too underdocumented to be thought of as 

 a contender. 

 

 4) modify the JVM 

 

 This is a longer term option. For the purposes of safepoints, perhaps the JVM 

 could treat reads from an mmapped file in the same way it treats threads that 

 are running JNI code. That is, the safepoint will proceed even though the 

 reading thread has not "joined in". Upon finishing its mmapped read, the 

 reading thread would test the safepoint page (check whether a safepoint is in 

 progress, in other words). 

 

 Conclusion: 

 

 I don't imagine there's an easy solution here. I plan to go ahead with 

 mitigation #1: "don't tolerate block devices that are slow", but I'd appreciate 

 any approach that doesn't require my hardware to be flawless all the time. 

 

 Josh 

 

 [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 

 -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 

 [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop 

 
 
 
 

 
 
 
 

 





Mime
View raw message