Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@locus.apache.org Received: (qmail 52621 invoked from network); 15 Nov 2008 23:39:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Nov 2008 23:39:09 -0000 Received: (qmail 63549 invoked by uid 500); 15 Nov 2008 23:39:17 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 63517 invoked by uid 500); 15 Nov 2008 23:39:17 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 63503 invoked by uid 99); 15 Nov 2008 23:39:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Nov 2008 15:39:17 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.25] (HELO rectangular.com) (68.116.39.25) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Nov 2008 23:37:52 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1L1Uio-00009N-CT for lucy-dev@lucene.apache.org; Sat, 15 Nov 2008 15:38:34 -0800 Date: Sat, 15 Nov 2008 15:38:34 -0800 To: lucy-dev@lucene.apache.org Subject: Optimizing InStream for mmap Message-ID: <20081115233834.GA523@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey X-Virus-Checked: Checked by ClamAV on apache.org Greets, In commits r3895 - r3925 to the KinoSearch repository, InStream has been optimized for internal use of mmap() on Unixen. * On 32-bit Unixen, InStream provides access to the file data via a variable width "sliding window". The window is opened and closed using continuous calls to mmap() and munmap(). * On systems without sys/mman.h (e.g. Windows), we fall back to using a malloc'd buffer and sequential reads to fake up a sliding window. * On 64-bit Unixen, mmap() only gets called once, at object creation time. There's no need for a sliding window. For optimum performance under 64-bit Unixen, client code can request a window the width of the entire file: Foo* Foo_new(InStream *instream) { Foo *self = (Foo*)CREATE(NULL, FOO); i64_t len = InStream_Length(instream); self->buf = InStream_Buf(instream, len); /* map whole file */ self->limit = buf + len; self->instream = REFCOUNT_INC(instream); return self; } Such code would work fine for small files on 32-bit systems. Large files, however, would cause such systems to blow up, either by exceeding addressable space and causing mmap() to fail, or, for systems without mmap(), through excessive memory consumption. To be portable to 32-bit systems, core modules will have to avoid mapping large files. If we want to max out the performance of PostingLists and Lexicons on 64-bit systems, that means we'll have to accept the increased maintenance burden of providing two different behaviors. I don't think the burden will be too heavy, though. Marvin Humphrey