lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: lucene benchmark and profiling
Date Fri, 16 May 2003 21:19:10 GMT
info axinews wrote:
> -for a little index (just 1 Mbyte), the search in a word who is not indexed
> (return therefor no result) invoke the readbyte() methode 2493 times.
> -for a big index (more 300 Mbyte), the search in a word who is not indexed
> invoke the readbyte() methode 169451 times (??)

Is your index optimized?  If it is not, then this is not surprising. 
Lucene must look up the term in each index segment.  Looking up a term 
in a segment requires that nearly 1k bytes are read.  A bigger 
unoptimized index has more segments, so this would require more i/o.  If 
you have not already, try optimizing your indexes before benchmarking. 
Optimizing merges all segments into a single segment.

> But I don't understand why there so lot invocation of this methode for a
> search in a word who is not in the index....
> In my mind, a "inverted index" save each indexed word in a file with the
> adress location pointing to a list of all the documents who contain this
> word.
> In the case where word appears rarely (or not), it seems to me logical that
> the result may be very fast because the list of documents who contains this
> term are very little (or indicate there no document that contain this term,
> therefor search may be immediate).

An unoptimized index consists of a set of inverted indexes, each called 
a segment.  Each segment has a term dictionary, contained in the .tii 
and .tis files.  The .tii is read entirely into memory and tells where 
to seek in the .tis file.  An average of 64 terms in the .tis file must 
then be scanned to find the requested term.  If the average term entry 
is around 10 bytes long, then this would result in 640 bytes read per 
query term per segment, regardless of whether the term exists in the 
index.  If it does exist, then the .frq and (in the case of a phrase 
query) the .prx file must also be read.

> The hypothesis that I make is that for each search, the entire term are
> reanalysed (term corresponding to .tii file) and despite the fact .tii file
> is in memory, the analyse of each term decelerate the search.... Isn't it?

Yes, the .tii file for each segment is read into memory when the 
IndexReader is created.  So, as long as you use the same IndexReader, 
this is not read again per query.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message