lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Understanding lucene indexes and disk I/O
Date Tue, 13 Apr 2010 16:32:11 GMT
On Tue, Apr 13, 2010 at 11:55 AM, Burton-West, Tom <tburtonw@umich.edu> wrote:

> At some point maybe the File Formats Document could be updated to make it clear that
the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx
deltas it holds absolute pointers.  Is it worth entering a JIRA issue?  I would be happy to
update the doc myself, but I'm don't think  I have enough of an in depth understanding.

Well I need to redo this part of file format docs, due to the flex
changes (LUCENE-2371)... I'll add a comment to remember to
state that this means we scan at most 128 terms.

> As you probably have guessed, I'm trying to understand the impact of the over 2.4 billion
unique terms in our indexes on performance (https://issues.apache.org/jira/browse/LUCENE-2257).
 We suspect that a very large percentage of these terms are due to dirty OCR, but have not
yet found a good way to eliminate a significant amount of dirty OCR.

Fun problem ;)

> I assume that these cause a few extra steps in the binary search of the tii file in memory
but we probably won't notice that performance impact since our bottleneck is disk I/O for
reading long postings lists for frequently occurring terms.

Yeah, and added RAM usage holding the tii in memory.

In theory, one could make a flex codec that instead of indexing every
Nth (128 by default) term, indexes in a non-regular manner, eg at any
term > X in frequency (though nobody has built such a codec!).  In
theory this'd be good for cases like yours... because you don't need
every 128 when many of them are very low count.  Ie the added time to
scan to a rare term is usually a non-issue since iterating that rare
term's docs will be so fast.  But added cost when scanning to a common
term "hurts" because it'll take so long to walk that common term's
docs.

> Am I correct in assuming that even if we have a very large number of garbage terms in
our prx file, the overall size of the file does not significantly affect the number of disk
seeks or amount of data to be read since Lucene can seek to the beginning of the postings
for any particular term?

Right, frq and prx.

>>> I would love to get ahold of your terms dict :)  I'd have a field day
>>>testing Lucene against it... I'm very curious how the flex improvements affect
your usage.
>
> Sometime in the next month or so we will get our new test server and after I get the
backup of testing jobs under control, I'd love to do some testing with flex and our data.

I'd love to see the results of this :) EG basic stats like
before/after size of the tis/tii files, RAM used on opening the
reader, startup up time, search performance...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message