Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <AANLkTimf1GH72D3Ny57EsK_8cYmV-tvlQEnlJlc4LYOC@mail.gmail.com>
References: <AANLkTikrOGH8NUrcaCyl_k8A-vXKQSDUjORiqetzPo22@mail.gmail.com>
	<AANLkTilScDi6jaJgaUn4XmtcFmyEc5sDTCKY4wrs8uf7@mail.gmail.com>
	<AANLkTinyXVo8rjn5kfRRF19193DzrR1nOcyz3Lo589r3@mail.gmail.com>
	<AANLkTik752jcFXBnuHG-aFU9D2DLPtNk0f9pDJqnhJKn@mail.gmail.com>
	<AANLkTimia9_3s2UcF-iKY47x4uumyQKieT5CTZhQT-FN@mail.gmail.com>
	<AANLkTimf1GH72D3Ny57EsK_8cYmV-tvlQEnlJlc4LYOC@mail.gmail.com>
Date: Tue, 13 Jul 2010 12:15:35 +0200
Message-ID: <AANLkTilRdEVzoC7eOE5eGRjQIRFOYk00mnBbIED1n_XS@mail.gmail.com>
Subject: Re: Minimizing the impact of compaction on latency and throughput
From: Peter Schuller <peter.schuller@infidyne.com>
To: dev@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

> This looks relevant:
> http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
> comments for directions to code sample)

Thanks. That's helpful; I've been trying to avoid JNI in the past so
wasn't familiar with the API, and the main difficulty was likely to be
how to best expose the functionality to Java. Having someone do almost
exactly the same thing helps ;)

I'm also glad they confirmed the effect in a very similar situation.
I'm also leaning towards O_DIRECT as well because:

(1) Even if posix_fadvise() is used, on writes you'll need to fsync()
before fadvise() anyway in order to allow Linux to evict the pages (a
theoretical OS implementation might remember the advise call, but
Linux doesn't - at least not up until recently).

(2) posix_fadvise() feels more obscure and less portable than
O_DIRECT, the latter being well-understood and used by e.g. databases
for a long time.

(3) O_DIRECT allows more direct control over when I/O happens and to
what extent (without playing tricks or making assumptions about e.g.
read-ahead) so will probably make it easier to kill both birds with
one stone.

You indicated you were skeptical about writing an I/O scheduler. While
I agree that writing a real I/O scheduler is difficult, I suspect that
if we do direct I/O a fairly simple scheme should work well. Being
able to tweak a target MB/sec rate, select a chunk size ,and select
the time window over which to rate limit, I suspect would go a long
way.

The situation is a bit special since in this case we are talking about
one type of I/O that is run during controlled circumstances
(controlled concurrency, we know how much memory we eat in total,
etc).

I suspect there may be a problem sustaining rates during high read
loads though. We'll see.

I'll try to make time for trying this out.

-- 
/ Peter Schuller