lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <ysee...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Date Thu, 13 Jul 2006 21:07:03 GMT
On 7/12/06, Ning Li <ning.li.li@gmail.com> wrote:
> > If it can be done in a separate class, using public APIs (or at least
> > with a minimum of protected access), without a loss in performance,
> > then that's the way to go IMO.
>
> This is exactly what I'm asking. Can it be done using public APIs
> without a loss in performance or functionality?

The answer to that is the crux of the matter :-)

Solr's implementation is here:
http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/update/DirectUpdateHandler2.java?view=markup
As I said, instead of keeping track of the maxSegment (equiv to the
docid of the in-memory segments), Solr keeps track of the number of
documents not to delete.
So, a delete sets the count to 0, an overwriting add sets the count to
"1", and a non-overwriting add increases the count.

I'll do some speculation on how Solr would compare with NewIndexModifier:
For a very small batch of updates to an existing index:
  - should be little or no difference... they both do the same amount of work.
Building a complete index, without any deletes:
  - no difference
Building a complete index, with deletes
  - there will be some differences...
    Currently, Solr only does the real deletes on a commit call (but
this could be changed).  That means that NewIndexModifier will be
doing deletes more often (every maxBufferedDocs).  The benefit to more
frequent deletes when doing a complete index build is that some of
them will be on a smaller index... deletes very early on in the
process will be faster than those later on when the index is larger.
The downside to more frequent deletes is that more IndexReaders are
opened and closed.
For a large batch of updates (deletes and adds) to an existing index:
  - probably Solr would be faster due to a single delete phase.

For the default Lucene maxBufferedDocs, I would guess Solr's method
would probably be faster in the majority of cases.  As maxBufferedDocs
increased, that advantage would lessen.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message