lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: URGENT: Help indexing large document set
Date Wed, 24 Nov 2004 00:13:22 GMT
Are you sure you have a performance problem with
TermInfosReader.get(Term)?  It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).

Chuck

  > -----Original Message-----
  > From: John Wang [mailto:john.wang@gmail.com]
  > Sent: Tuesday, November 23, 2004 3:38 PM
  > To: lucene-user@jakarta.apache.org
  > Subject: URGENT: Help indexing large document set
  > 
  > Hi:
  > 
  >    I am trying to index 1M documents, with batches of 500 documents.
  > 
  >    Each document has an unique text key, which is added as a
  > Field.KeyWord(name,value).
  > 
  >    For each batch of 500, I need to make sure I am not adding a
  > document with a key that is already in the current index.
  > 
  >   To do this, I am calling IndexSearcher.docFreq for each document
and
  > delete the document currently in the index with the same key:
  > 
  >        while (keyIter.hasNext()) {
  >             String objectID = (String) keyIter.next();
  >             term = new Term("key", objectID);
  >             int count = localSearcher.docFreq(term);
  > 
  >             if (count != 0) {
  >                 localReader.delete(term);
  >             }
  >           }
  > 
  > Then I proceed with adding the documents.
  > 
  > This turns out to be extremely expensive, I looked into the code and
I
  > see in
  > TermInfosReader.get(Term term) it is doing a linear look up for each
  > term. So as the index grows, the above operation degrades at a
linear
  > rate. So for each commit, we are doing a docFreq for 500 documents.
  > 
  > I also tried to create a BooleanQuery composed of 500 TermQueries
and
  > do 1 search for each batch, and the performance didn't get better.
And
  > if the batch size increases to say 50,000, creating a BooleanQuery
  > composed of 50,000 TermQuery instances may introduce huge memory
  > costs.
  > 
  > Is there a better way to do this?
  > 
  > Can TermInfosReader.get(Term term) be optimized to do a binary
lookup
  > instead of a linear walk? Of course that depends on whether the
terms
  > are stored in sorted order, are they?
  > 
  > This is very urgent, thanks in advance for all your help.
  > 
  > -John
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message