Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <AANLkTimWXsURwLfcPsW5c_FMsPRuQq4-93dUzQlbhmNi@mail.gmail.com>
References: <AANLkTinLTl_waviGkolmGewmFj6kj6hs54QJTwZPfejq@mail.gmail.com>
	<AANLkTilw-yFqbV5bZeOP969fVOkR4T28lYSv1OQUAj9Z@mail.gmail.com>
	<AANLkTimWXsURwLfcPsW5c_FMsPRuQq4-93dUzQlbhmNi@mail.gmail.com>
Date: Mon, 21 Jun 2010 09:09:30 +0200
Message-ID: <AANLkTilqb7VYYpkotb3iwTkpJefM4HCHn6pJFFJO1VUl@mail.gmail.com>
Subject: Re: Instability and memory problems
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

>> (1) Is the machine swapping? (Actively swapping in/out as reported by
>> e.g. vmstat)
>
> Yes, somewhat, although swappiness is set to 0.

Ok. While I have no good suggestion to fix it other than moving away
from mmap(), given that a low swappiness didn't help, I'd say that as
long as you're swapping you're pretty screwed as far as production
systems go and maintaining low latency. That is, unless you're
definitely swapping less than what might account for the performance
issues you're having.

> It runs, but I wouldn't say excessively.

Ok.

>> (3) mmap():ed memory that is currently resident will count towards
>> RSS; if you're using mmap():ed I/O (the default), that is to be
>> expected.
>
> This is where I'm a little confused. I thought that mmap()'d IO didn't
> actually allocate memory. I thought it was just IO through a faster code
> path.

(The below refers only to mmap() as used when mapping files; mmap() in
and of itself is used for other purposes too, such as by malloc()
under some conditions. Please remember this even though I don't repeat
it on every mention.)

What mmap() will do when used to map files, is to allocate address
space in the virtual memory, which the operating system does not need
to actually allocate from physical RAM (though it may need swap
depending on whether the operating system is configured to allow
over-commit).

The application then proceeds touching pages of memory in the range
allocated by mmap() and it is up to the kernel to page data in and out
using some algorithm that is up to the operating system. Often
something similar to LRU behavior is used with respect to page
eviction, and during page-in read-ahead may be applied.

The "faster" bit comes from the fact that for data that is already
paged in memory, your program is doing nothing but touching memory
through the normal virtual memory system. No system call is required,
and no copying of data to/from user space for reads, and only
asynchronously on writes.

A downside with mmap() (in my opinion) is that your application no
longer has control over when/what is being read from or written to
disk since it is entirely up to the operating system. It also tends to
be more difficult to understand what is going on when a system is
under high I/O load; such as what the memory is in fact being used
for, what is causing disk I/O, etc.

A related problem in the sense that the operating system gets the
control, is that the operating system does not know what you know, as
an application. One of the problems in this area is specifically - how
should the mmap():ed data be balanced with that of the application
(some combination of brk() and mmap() (this time not to file) backed
address space).

If the operating system makes the "wrong" decision, such as swapping
out the JVM, you've got a problem. And it is not always trivial to
fix. If someone knows how to convince Linux to de-prioritize mmap();ed
I/O, other than decreasing swappiness, I'd love to hear about it.

Anyways: The problem in cases like these is that while mmap() does
give you a performance boost under some circumstances along some axis
of performance measurement, you also lose control - and if the
operating system doesn't happen to do what you want it to do, the OS
does not always give you appropriate tuning/control facilities.

But to be clear - no, mmap():ing, say, 1 TB of memory does not imply
that you actually need that much physical RAM available. It's just
that the memory that *is* paged into physical RAM at any given moment,
accounts towards RSS of the process (on Linux).

In your case: I'm not sure what the load is on your cluster. Is it
possible the periods of poor performance are correlated with
concurrent mark/sweep phases in the CMS GC? If the JVM is getting
swapped out slowly over time, you would expect this to primarily apply
to data outside of the active working set. Then when the mark/sweep GC
finally kicks in, touching most of the JVM heap, you begin (1)
swapping, causing the CMS process itself to be slow, and (2)
drastically change the set of data cached in RAM.

How much of your physical RAM is dedicatd to the JVM?

I forgot to say that you probably should consider lowering it
significantly (to be continued, getting off the subway...).

> I tried switching to standard IO mode, but it was very, very slow. What I'm
> confused about here is that if mmap()'d IO actually allocates memory that
> can put pressure on other processes' memory, is there no way to bound that?
> If not, how can anybody safely use mmap()'d IO on the JVM without risking
> pushing their process's important pages out of memory.
> swappiness is already at 0.

You can use mmap() mostly because of its behavior as described above;
that the operating system can dynamically choose what to keep in
physical memory and not. But you do need the address *space* (tends to
be a problem on 32 bit platforms and in the case of the JVM for legacy
reasons where you can only mmap() 2 GB at a time).


-- 
/ Peter Schuller