Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 64168 invoked from network); 21 Jun 2010 07:10:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Jun 2010 07:10:07 -0000 Received: (qmail 79358 invoked by uid 500); 21 Jun 2010 07:10:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 79109 invoked by uid 500); 21 Jun 2010 07:10:04 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 79101 invoked by uid 99); 21 Jun 2010 07:10:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jun 2010 07:10:03 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jun 2010 07:09:51 +0000 Received: by wya21 with SMTP id 21so2330660wya.31 for ; Mon, 21 Jun 2010 00:09:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.90.138 with SMTP id e10mr3091321wef.51.1277104171076; Mon, 21 Jun 2010 00:09:31 -0700 (PDT) Sender: scode@scode.org Received: by 10.216.36.133 with HTTP; Mon, 21 Jun 2010 00:09:30 -0700 (PDT) X-Originating-IP: [90.232.236.232] In-Reply-To: References: Date: Mon, 21 Jun 2010 09:09:30 +0200 X-Google-Sender-Auth: ohv5W6WdG_ZFcLSkC9AjD7TM1ik Message-ID: Subject: Re: Instability and memory problems From: Peter Schuller To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org >> (1) Is the machine swapping? (Actively swapping in/out as reported by >> e.g. vmstat) > > Yes, somewhat, although swappiness is set to 0. Ok. While I have no good suggestion to fix it other than moving away from mmap(), given that a low swappiness didn't help, I'd say that as long as you're swapping you're pretty screwed as far as production systems go and maintaining low latency. That is, unless you're definitely swapping less than what might account for the performance issues you're having. > It runs, but I wouldn't say excessively. Ok. >> (3) mmap():ed memory that is currently resident will count towards >> RSS; if you're using mmap():ed I/O (the default), that is to be >> expected. > > This is where I'm a little confused. I thought that mmap()'d IO didn't > actually allocate memory. I thought it was just IO through a faster code > path. (The below refers only to mmap() as used when mapping files; mmap() in and of itself is used for other purposes too, such as by malloc() under some conditions. Please remember this even though I don't repeat it on every mention.) What mmap() will do when used to map files, is to allocate address space in the virtual memory, which the operating system does not need to actually allocate from physical RAM (though it may need swap depending on whether the operating system is configured to allow over-commit). The application then proceeds touching pages of memory in the range allocated by mmap() and it is up to the kernel to page data in and out using some algorithm that is up to the operating system. Often something similar to LRU behavior is used with respect to page eviction, and during page-in read-ahead may be applied. The "faster" bit comes from the fact that for data that is already paged in memory, your program is doing nothing but touching memory through the normal virtual memory system. No system call is required, and no copying of data to/from user space for reads, and only asynchronously on writes. A downside with mmap() (in my opinion) is that your application no longer has control over when/what is being read from or written to disk since it is entirely up to the operating system. It also tends to be more difficult to understand what is going on when a system is under high I/O load; such as what the memory is in fact being used for, what is causing disk I/O, etc. A related problem in the sense that the operating system gets the control, is that the operating system does not know what you know, as an application. One of the problems in this area is specifically - how should the mmap():ed data be balanced with that of the application (some combination of brk() and mmap() (this time not to file) backed address space). If the operating system makes the "wrong" decision, such as swapping out the JVM, you've got a problem. And it is not always trivial to fix. If someone knows how to convince Linux to de-prioritize mmap();ed I/O, other than decreasing swappiness, I'd love to hear about it. Anyways: The problem in cases like these is that while mmap() does give you a performance boost under some circumstances along some axis of performance measurement, you also lose control - and if the operating system doesn't happen to do what you want it to do, the OS does not always give you appropriate tuning/control facilities. But to be clear - no, mmap():ing, say, 1 TB of memory does not imply that you actually need that much physical RAM available. It's just that the memory that *is* paged into physical RAM at any given moment, accounts towards RSS of the process (on Linux). In your case: I'm not sure what the load is on your cluster. Is it possible the periods of poor performance are correlated with concurrent mark/sweep phases in the CMS GC? If the JVM is getting swapped out slowly over time, you would expect this to primarily apply to data outside of the active working set. Then when the mark/sweep GC finally kicks in, touching most of the JVM heap, you begin (1) swapping, causing the CMS process itself to be slow, and (2) drastically change the set of data cached in RAM. How much of your physical RAM is dedicatd to the JVM? I forgot to say that you probably should consider lowering it significantly (to be continued, getting off the subway...). > I tried switching to standard IO mode, but it was very, very slow. What I'm > confused about here is that if mmap()'d IO actually allocates memory that > can put pressure on other processes' memory, is there no way to bound that? > If not, how can anybody safely use mmap()'d IO on the JVM without risking > pushing their process's important pages out of memory. > swappiness is already at 0. You can use mmap() mostly because of its behavior as described above; that the operating system can dynamically choose what to keep in physical memory and not. But you do need the address *space* (tends to be a problem on 32 bit platforms and in the case of the JVM for legacy reasons where you can only mmap() 2 GB at a time). -- / Peter Schuller