lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: First search is slow after updating index .. subsequent searches very fast
Date Fri, 22 Dec 2006 19:02:28 GMT
I am no expert, but as I gloss over the code this is what I see happens 
for a sort (sometimes the less experienced has to get it wrong before an 
expert will jump in with some good info <G> *hint to experts*):

The field cache caches <document : term> pairs. When you sort on a field 
you don't want to have to extract the content of the field as you are 
doing the sort. So the first time you do a sort, the fieldcache is 
loaded up that stores the term to sort on for each document id.

For numerics this involves loading up an array that is keyed by doc id 
with a value of the numeric (I didn't look at the float handling that I 
think I saw).  i.e. Array(documentid) = termtext
The field cache for an integer sort is 32 bits times the number of docs 
then (termtext would be an int).
For a String sort an array is made of all of the Terms in the index 
(terms in the field being sorted on I believe) and another array is made 
that indexes into that term array. So to see what you are sorting on for 
document 4 you would get the value of the first array at position 3 and 
use the result as the index into the Term array. i.e. Array(documentId) 
= index , Terms(index) = termtext
The size of a String fieldcache is going to be the size of those two 
Arrays : 32 bits X number of docs + size of all of the terms (again I 
think this is just the terms in that field, but I have not seen anyone 
say that before).

If you followed that horrible example of an explanation than you can see 
why it is all done at once when you first ask for a sort. The arrays 
created are stored in a weakhashmap that is keyed on by the IndexReader 
that was used for the search (A Searcher contains an IndexReader). So 
every time you open a new Searcher and do a field sorted search it will 
need to cache the <document : term> pairs (using those arrays I talk 
about above).

The actual sorting appears to happen just like with relevancy score 
sorting....using a priority queue that is loaded as a HitCollector 
visits each document.

There may be a little more to the story -- I think I saw that 
Comparators used for sorting are also cached, but someone else will have 
to correct me...I am out of time.

The way to avoid this warm up time that takes place (due to loading up 
those fieldcaches), is to pre-warm a Searcher. When an update is made to 
the index, instead of just opening a new Searcher, keep using the old 
Searcher to serve search requests, start up a new searcher in a 
different thread and perform a sorted search on it, then replace the 
stale Searcher with the new warmed up Searcher.


- Mark

Bryan Dotzour wrote:
> Thanks for that tidbit Mark.  I was just looking through the LIA book
> and stumbled across this sentence under the "5.1.9 Performance effect of
> sorting" section.  It says: "[When sorting by a String type] each unique
> term is also cached for each document. Only the actual fields used for
> sorting are cached in this manner."
>
> In the case that I originally described, our default sorting mechanism
> is an alphabetical sort on the title of each object returned in the
> search.  So I take this excerpt from the book to mean that the
> FieldCache has to read each title value from each document in order to
> perform the sort.  That pretty much sounds like exactly what you're
> saying Mark.
>
> I guess the only question left in my mind is, does the FieldCache have
> to read every value for every document in the entire index to perform
> the sort, or just the values in the documents returned in the search?
> My guess would be the latter although this one index seems much slower
> than all of the others and the only difference is the sheer number of
> items in the index.
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com] 
> Sent: Thursday, December 21, 2006 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: First search is slow after updating index .. subsequent
> searches very fast
>
> Since you say you are sorting on a field the bulk of the time will be 
> doing the sort and caching it (FieldCache). Subsequent searches use that
>
> cache to avoid paying the full sort cost again. If you where doing 
> relevancy sorting you would not experience such a big delay.
>
> - Mark
>
> Bryan Dotzour wrote:
>   
>> Otis thanks for your suggestion, it seems to be working pretty well!
>> I'm just curious if you (or anyone else) could describe what is
>>     
> actually
>   
>> happening during that initial query that ends up taking so much time.
>> We have several different indexes for different types of objects and
>> it's only this one index that exhibits this kind of behavior.  Is it
>> something related to the size of the index, or the number of fields,
>>     
> or
>   
>> how fragmented the index is?
>>
>> I'm just trying to get a little better understanding of what is going
>>     
> on
>   
>> under the covers there.  I'll spend some time with the source to see
>>     
> if
>   
>> I can figure it out, but any tips from the experts would be much
>> appreciated. =)
>>
>> Thanks again!
>> Bryan
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
>> Sent: Wednesday, December 20, 2006 4:28 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: First search is slow after updating index .. subsequent
>> searches very fast
>>
>> To populate FieldCache, the number of matches doesn't matter.  There
>>     
> is
>   
>> no need to be scrimy there - you don't really save anything by running
>>     
> a
>   
>> query that matches only a few docs.  Just run something that looks
>>     
> like
>   
>> a common query.
>>
>> For warming up new indices, one can also use the `dd' trick under
>>     
> UNIX.
>   
>> Otis
>>
>> ----- Original Message ----
>> From: Bryan Dotzour <BDotzour@widen.com>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, December 20, 2006 5:23:40 PM
>> Subject: RE: First search is slow after updating index .. subsequent
>> searches very fast
>>
>> One question about this, Otis... When "warming up" the new searcher,
>> should the query return a lot of results, or does it matter?  Can I
>>     
> just
>   
>> do like an ID = X query and get one document back?  Is that sufficient
>> or is it better to run a query that will get lots of hits?
>>
>> Thanks again,
>> Bryan
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
>> Sent: Wednesday, December 20, 2006 3:28 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: First search is slow after updating index .. subsequent
>> searches very fast
>>
>> All sounds good.  Opening a new IndexReader can take a bit of time.
>>     
> If
>   
>> you use sorting of any kind other than default sorting by relevance,
>> this delay on the first search is also probably caused by the lazy
>> FieldCache population.  The cure for that is to open a new
>> IndexReader/Searcher before you close the old one, warm it up with a
>> query + sort, and then switch IndexReader/Searchers, closing the old
>> one.
>>
>> Otis
>>
>> ----- Original Message ----
>> From: Bryan Dotzour <BDotzour@widen.com>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, December 20, 2006 3:59:19 PM
>> Subject: First search is slow after updating index .. subsequent
>> searches very fast
>>
>> I'm investigating some performance issues with the way we're using
>> Lucene in our web app and am interested if anyone could shed some
>>     
> light
>   
>> on what might be going on.  Hopefully I can provide enough
>>     
> information,
>   
>> please let me know if there's more I can give.
>>
>>  
>>
>> We're using Lucene 2.0.0 and I'm currently working with disk-based
>> indexing (although in production I'll want to be using RAM indexing).
>> In our environment, we build up our Lucene index at application start
>>     
> up
>   
>> time and then we optimize the index.  From then on, updates and
>>     
> deletes
>   
>> to the index occur fairly frequently but we don't optimize until the
>> middle of the night when the impact would be at its minimum.  After a
>> while, what I see is that searches will be very fast (~400 ms) until I
>> make a modification that will require a single document to be
>> re-indexed.  Immediately after that has occurred, the next search will
>> take substantially longer (sometimes up to ~25s).  After that search
>>     
> has
>   
>> run, the next search will be back at the ~400ms time.
>>
>>  
>>
>> Our algorithm for handling the updates is as follows:
>>
>> 1.       open an IndexReader on the directory
>>
>> 2.       delete the document using the reader
>>
>> 3.       close the reader
>>
>> 4.       open an IndexWriter
>>
>> 5.       add the new document using the writer
>>
>> 6.       close the writer
>>
>>  
>>
>> For searches:
>>
>> 1.    We cache off an IndexReader for the index, as well as an
>> IndexSearcher, which uses that reader
>> 2.    When a search is initiated we check to see if the version of the
>> index has changed using getCurrentVersion()
>> 3.    If it has changed, we close our IndexSearcher, close the
>> IndexReader and re-open them both
>>
>>  
>>
>>  
>>
>> Anything sound non-standard in that workflow?  Does anyone have an
>>     
> idea
>   
>> of what might be happening during that slow down?
>>
>>  
>>
>> Thanks for your time,
>>
>> Bryan
>>
>>  
>>
>>  
>>
>>  
>>
>> (For a little more info, here is a very common stack trace snippet
>>     
> that
>   
>> I gather when the "slow search" is running.  It seems much of the time
>> is spent in MultiReader or MultiTermDocs)
>>
>>  
>>
>>     
> org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(Com
>   
>> poundFileReader.java:214)
>>  
>>
>>     
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav
>   
>> a:64)
>>  
>>
>>     
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j
>   
>> ava:33)
>>       org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
>>       org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62)
>>  
>> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117)
>>  
>>
>>     
> org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148)
>   
>>  
>>
>>     
> org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:15
>   
>> 7)
>>  
>> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151)
>>  
>> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50)
>>  
>> org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392)
>>       org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348)
>>       org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
>>       org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
>>       org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
>>  
>>
>>     
> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:171)
>   
>>  
>>
>>     
> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:153)
>   
>>  
>>
>>     
> org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:349)
>   
>>  
>>
>>     
> org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH
>   
>> itQueue.java:346)
>>  
>>
>>     
> org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo
>   
>> rtedHitQueue.java:189)
>>  
>>
>>     
> org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:5
>   
>> 8)
>>  
>>
>>     
> org.apache.lucene.search.TopFieldDocCollector.(TopFieldDocCollector.java
>   
>> :40)
>>  
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:108)
>>       org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
>>       org.apache.lucene.search.Hits.(Hits.java:52)
>>       org.apache.lucene.search.Searcher.search(Searcher.java:53)
>>
>>  
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>   
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message