lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject URGENT: Help indexing large document set
Date Tue, 23 Nov 2004 23:37:44 GMT
Hi:

   I am trying to index 1M documents, with batches of 500 documents.

   Each document has an unique text key, which is added as a
Field.KeyWord(name,value).

   For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.

  To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
 
       while (keyIter.hasNext()) {
            String objectID = (String) keyIter.next();
            term = new Term("key", objectID);
            int count = localSearcher.docFreq(term);

            if (count != 0) {
                localReader.delete(term);
            }
          }

Then I proceed with adding the documents.

This turns out to be extremely expensive, I looked into the code and I see in 
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.

I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.

Is there a better way to do this?

Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?

This is very urgent, thanks in advance for all your help.

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message