Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 52728 invoked from network); 13 Apr 2010 09:27:17 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Apr 2010 09:27:17 -0000 Received: (qmail 97443 invoked by uid 500); 13 Apr 2010 09:27:15 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97227 invoked by uid 500); 13 Apr 2010 09:27:15 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97213 invoked by uid 99); 13 Apr 2010 09:27:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Apr 2010 09:27:14 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.179] (HELO mail-yx0-f179.google.com) (209.85.210.179) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Apr 2010 09:27:07 +0000 Received: by yxe9 with SMTP id 9so3414349yxe.29 for ; Tue, 13 Apr 2010 02:26:46 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.137.5 with HTTP; Tue, 13 Apr 2010 02:26:46 -0700 (PDT) In-Reply-To: References: Date: Tue, 13 Apr 2010 05:26:46 -0400 Received: by 10.151.92.15 with SMTP id u15mr5338541ybl.11.1271150806498; Tue, 13 Apr 2010 02:26:46 -0700 (PDT) Message-ID: Subject: Re: Understanding lucene indexes and disk I/O From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Tom, Fear not: we only scan up to 128 terms, to find the specific term. First, the terms dict index (tii) is fully loaded into RAM, and then a binary search is done on this (in-RAM) to find the nearest index term just before the term you want. Then, we seek to that spot in the main terms dict (tis) file, and scan (at most 128 entries) to find the term. On the frq/prx deltas: the tii holds absolute pointers. So, on seeking to that first spot in the tis, we know the absolute frq/prx (long) offsets, and then during scanning we just add the deltas we see to those base absolutes. BTW, this approach sucks up alot of extra RAM (holding 2 longs =3D 16 bytes, per indexed term)... in flex (now committed to trunk) we've improved this so that the tii entry does not hold these longs, and, instead, every indexed term inside the tis file holds the absolute pointer not the delta. We also switched to parallel arrays, to avoid the GC/object RAM overhead of one object per indexed term, and used packed ints for some of these arrays. Hmm though we also hold the terms characters as utf8 bytes, which may net/net take more RAM for your case... I would love to get ahold of your terms dict :) I'd have a field day testing Lucene against it... I'm very curious how the flex improvements affect your usage. Mike On Mon, Apr 12, 2010 at 5:57 PM, Burton-West, Tom wrot= e: > Hi all, > > Please let me know if this should be posted instead to the Lucene java-de= v list. > > We have very large tis files (about 36 GB). =A0I have not been too concer= ned as I assumed that due to the indexing of the tis file by the tii file, = only a small portion of the file needed to be read. =A0However, upon re-rea= ding the Lucene Index File Formats document: http://lucene.apache.org/java/= 3_0_1/fileformats.html#Term%20Dictionary > I am now wondering whether most of the tis file may need to be read if a = term is near the end of the file. > > I'm trying to understand whether lucene has enough information stored in = the *tii and *tis files to determine what byte offset in the prx file to se= ek to in order to get the freq or positions list for a particular term and = whether a sequential read of the entire *tis file up to the term being soug= ht is needed in order to decode ProxDeltas and determint the byte offset. > > =A0What has me confused is that the Lucene Index File Formats document sa= ys: > > "ProxDelta determines the position of this term's TermPositions within th= e .prx file. In particular, it is the difference between the position of th= is term's data in that file and the position of the previous term's data (o= r zero, for the first term in the file. For fields with omitTf true, this w= ill be 0 since prox information is not stored." > > I assumed that Lucene implements a binary search of the tii file, then re= ads the appropriate 128 (IndexDivisor) entries from the tis file and does a= binary search on that. =A0 (So that should be one disk seek when the searc= her starts up to read in the entire tii file and then a second disk seek to= load in the appropriate data for 128 terms from the tis file. =A0However, = once a term's entry in the tis file is found, if only the difference betwee= n this term and the previous term's position in the prx file is stored, it = seems that in order to get the actual byte offset of a term in the prx file= , all of the previous term's ProxDelta's =A0starting at the first term in t= he file, would have to be read and added together. =A0If that is true then = we are talking a sequential read of the entire tis file up to the current t= erm. > > Is this correct? =A0 Can someone point me to the area of the code base wh= ere this is implemented ? =A0 Am I missing something here? > > > Tom Burton-West > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org