lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: URGENT: Help indexing large document set
Date Wed, 24 Nov 2004 00:22:59 GMT
Thanks Chuck! I missed the call: getIndexOffset.
I am profiling it again to pin-point where the performance problem is.

-John

On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams <chuck@manawiz.com> wrote:
> Are you sure you have a performance problem with
> TermInfosReader.get(Term)?  It looks to me like it scans sequentially
> only within a small buffer window (of size
> SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
> See TermInfosReader.getIndexOffset(Term).
> 
> Chuck
> 
> 
> 
>  > -----Original Message-----
>  > From: John Wang [mailto:john.wang@gmail.com]
>  > Sent: Tuesday, November 23, 2004 3:38 PM
>  > To: lucene-user@jakarta.apache.org
>  > Subject: URGENT: Help indexing large document set
>  >
>  > Hi:
>  >
>  >    I am trying to index 1M documents, with batches of 500 documents.
>  >
>  >    Each document has an unique text key, which is added as a
>  > Field.KeyWord(name,value).
>  >
>  >    For each batch of 500, I need to make sure I am not adding a
>  > document with a key that is already in the current index.
>  >
>  >   To do this, I am calling IndexSearcher.docFreq for each document
> and
>  > delete the document currently in the index with the same key:
>  >
>  >        while (keyIter.hasNext()) {
>  >             String objectID = (String) keyIter.next();
>  >             term = new Term("key", objectID);
>  >             int count = localSearcher.docFreq(term);
>  >
>  >             if (count != 0) {
>  >                 localReader.delete(term);
>  >             }
>  >           }
>  >
>  > Then I proceed with adding the documents.
>  >
>  > This turns out to be extremely expensive, I looked into the code and
> I
>  > see in
>  > TermInfosReader.get(Term term) it is doing a linear look up for each
>  > term. So as the index grows, the above operation degrades at a
> linear
>  > rate. So for each commit, we are doing a docFreq for 500 documents.
>  >
>  > I also tried to create a BooleanQuery composed of 500 TermQueries
> and
>  > do 1 search for each batch, and the performance didn't get better.
> And
>  > if the batch size increases to say 50,000, creating a BooleanQuery
>  > composed of 50,000 TermQuery instances may introduce huge memory
>  > costs.
>  >
>  > Is there a better way to do this?
>  >
>  > Can TermInfosReader.get(Term term) be optimized to do a binary
> lookup
>  > instead of a linear walk? Of course that depends on whether the
> terms
>  > are stored in sorted order, are they?
>  >
>  > This is very urgent, thanks in advance for all your help.
>  >
>  > -John
>  >
>  >
> ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message