lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject RE: Understanding lucene indexes and disk I/O
Date Tue, 13 Apr 2010 15:55:42 GMT
Thanks Mike,

At some point maybe the File Formats Document could be updated to make it clear that the tii
has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas
it holds absolute pointers.  Is it worth entering a JIRA issue?  I would be happy to update
the doc myself, but I'm don't think  I have enough of an in depth understanding.

As you probably have guessed, I'm trying to understand the impact of the over 2.4 billion
unique terms in our indexes on performance (
 We suspect that a very large percentage of these terms are due to dirty OCR, but have not
yet found a good way to eliminate a significant amount of dirty OCR.   

I assume that these cause a few extra steps in the binary search of the tii file in memory
but we probably won't notice that performance impact since our bottleneck is disk I/O for
reading long postings lists for frequently occurring terms.

Am I correct in assuming that even if we have a very large number of garbage terms in our
prx file, the overall size of the file does not significantly affect the number of disk seeks
or amount of data to be read since Lucene can seek to the beginning of the postings for any
particular term?

>> I would love to get ahold of your terms dict :)  I'd have a field day
>>testing Lucene against it... I'm very curious how the flex improvements affect your

Sometime in the next month or so we will get our new test server and after I get the backup
of testing jobs under control, I'd love to do some testing with flex and our data.  


-----Original Message-----
From: Michael McCandless [] 
Sent: Tuesday, April 13, 2010 5:27 AM
Subject: Re: Understanding lucene indexes and disk I/O

Hi Tom,

Fear not: we only scan up to 128 terms, to find the specific term.

First, the terms dict index (tii) is fully loaded into RAM, and then a
binary search is done on this (in-RAM) to find the nearest index term
just before the term you want.  Then, we seek to that spot in the
main terms dict (tis) file, and scan (at most 128 entries) to find the

On the frq/prx deltas: the tii holds absolute pointers.  So, on
seeking to that first spot in the tis, we know the absolute frq/prx
(long) offsets, and then during scanning we just add the deltas we
see to those base absolutes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message