Return-Path: Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: (qmail 79429 invoked from network); 13 Jul 2010 10:17:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Jul 2010 10:17:08 -0000 Received: (qmail 59747 invoked by uid 500); 13 Jul 2010 10:17:08 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 59468 invoked by uid 500); 13 Jul 2010 10:17:05 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 59455 invoked by uid 99); 13 Jul 2010 10:17:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jul 2010 10:17:04 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jul 2010 10:16:55 +0000 Received: by wyb40 with SMTP id 40so4358793wyb.31 for ; Tue, 13 Jul 2010 03:15:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.180.202 with SMTP id j52mr9467460wem.112.1279016135136; Tue, 13 Jul 2010 03:15:35 -0700 (PDT) Sender: scode@scode.org Received: by 10.216.234.18 with HTTP; Tue, 13 Jul 2010 03:15:35 -0700 (PDT) X-Originating-IP: [212.181.83.218] In-Reply-To: References: Date: Tue, 13 Jul 2010 12:15:35 +0200 X-Google-Sender-Auth: r0kc-e_xXMwEuMOaxZ27gEvUWag Message-ID: Subject: Re: Minimizing the impact of compaction on latency and throughput From: Peter Schuller To: dev@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org > This looks relevant: > http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see > comments for directions to code sample) Thanks. That's helpful; I've been trying to avoid JNI in the past so wasn't familiar with the API, and the main difficulty was likely to be how to best expose the functionality to Java. Having someone do almost exactly the same thing helps ;) I'm also glad they confirmed the effect in a very similar situation. I'm also leaning towards O_DIRECT as well because: (1) Even if posix_fadvise() is used, on writes you'll need to fsync() before fadvise() anyway in order to allow Linux to evict the pages (a theoretical OS implementation might remember the advise call, but Linux doesn't - at least not up until recently). (2) posix_fadvise() feels more obscure and less portable than O_DIRECT, the latter being well-understood and used by e.g. databases for a long time. (3) O_DIRECT allows more direct control over when I/O happens and to what extent (without playing tricks or making assumptions about e.g. read-ahead) so will probably make it easier to kill both birds with one stone. You indicated you were skeptical about writing an I/O scheduler. While I agree that writing a real I/O scheduler is difficult, I suspect that if we do direct I/O a fairly simple scheme should work well. Being able to tweak a target MB/sec rate, select a chunk size ,and select the time window over which to rate limit, I suspect would go a long way. The situation is a bit special since in this case we are talking about one type of I/O that is run during controlled circumstances (controlled concurrency, we know how much memory we eat in total, etc). I suspect there may be a problem sustaining rates during high read loads though. We'll see. I'll try to make time for trying this out. -- / Peter Schuller