lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "info axinews" <i...@axinews.com>
Subject lucene benchmark and profiling
Date Fri, 16 May 2003 15:00:26 GMT
Hi all,


I use Lucene for a large indexing. After a lot of time to apply the indexing
to my messages (2.5 millions messages, each is ~3000 bytes ..... My index is
4 GigaByte), I have an interrogation : my search are more and more slow when
the index is more an more large.

e.g :
I use linux redhat, 500 Mbytes ram, HD IDE and lucene 1.3.
-When my index is 300 Megabyte, the search appears in 40 Milliseconds.
-Whith the index of 4 GigaByte, the search appears in 2000 Miliseconds (just
the search).
I suppose that if my index will swell (that will be the case in the futur),
the search will be still more slow.

Another precision :
- for first search in a word  : 700 Milisecondes just for search and 400
Milisecondes for retrieve the summery field in disk for the 20 first hit
- for the second search in the even word : 600 Milisecondes just for search
and just 4 Milisecondes for retrieve the even summery field in disk (of
course
because in disk cache)

Then, IO acces disk is not realy the problem (because I have no power about
that). but why the search is always so slow for the second search? it could
not be an IO acces disk problem because the index could be already in
cache... I continue thus my investigations :

I study the source of lucene and make a benchmark with a code profiler.
Therefor, the problem appears quite clear : the "InputStream" class
monopolize the majority of the ressources (particulary the readByte()
methode who is a byte operations )
I count the methode invocation of readbyte() after change readbyte() methode
sources and recompile lucene's sources :
-for a little index (just 1 Mbyte), the search in a word who is not indexed
(return therefor no result) invoke the readbyte() methode 2493 times.
-for a big index (more 300 Mbyte), the search in a word who is not indexed
invoke the readbyte() methode 169451 times (??)
That's the reason why more the index is large and more the search query are
slow...
I suppose that if the index will be still more large, the search will be
still more slow because the invocation of readByte() will be still more
many.

But I don't understand why there so lot invocation of this methode for a
search in a word who is not in the index....
In my mind, a "inverted index" save each indexed word in a file with the
adress location pointing to a list of all the documents who contain this
word.
In the case where word appears rarely (or not), it seems to me logical that
the result may be very fast because the list of documents who contains this
term are very little (or indicate there no document that contain this term,
therefor search may be immediate). But that's not the case. The analyse
indicate that even with a word who appears not, there a lot of byte
operation (who decelerate the search), and I don't know why...
The hypothesis that I make is that for each search, the entire term are
reanalysed (term corresponding to .tii file) and despite the fact .tii file
is in memory, the analyse of each term decelerate the search.... Isn't it?
another ideas?

S├ębastien

(English is not my natural language, in case where there a lot of language
fault)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message