lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <ch...@manawiz.com>
Subject Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))
Date Sat, 08 Jul 2006 03:12:04 GMT

DM Smith wrote on 07/07/2006 07:07 PM:
> Otis,
>     First let me say, I don't want to rehash the arguments for or
> against Java 1.5.

This is an emotional issue for people on both sides.

>     However, I think you have identified that the core people need to
> make a decision and the rest of us need to go with it.

It would be most helpful to have clarity on this issue.

> On Jul 7, 2006, at 1:17 PM, Otis Gospodnetic wrote:
>
>> Hi Chuck,
>>
>> I think bulk update would be good (although I'm not sure how it would
>> be different from batching deletes and adds, but I'm sure there is a
>> difference, or else you wouldn't have done it).

Bulk update works by rewriting all segments that contain a document to
be modified in a single linear pass.  This is orders of magnitude faster
than delete/add if the set of documents to be updated is large,
especially if only a few small fields are mutable on Documents that have
many possibly large immutable fields.  E.g., on a somewhat slow
development machine I updated several fields on 1,000,000 large
documents in 43 seconds.

There is an existing patch in jira that takes this same approach
(LUCENE-382).  However the limitations in that patch are substantial: 
only optimized indexes, stored fields are not updated, updates are
independent of the existing field value, etc.  These limitations make
that implementation not suitable for many use cases.

My implementation eliminates all of those limitations, providing a fast
flexible solution for applying an arbitrary value transformation to
selected documents and fields in the index (doc.field.new_value = f(doc,
field.old_value, doc.other_field_values) for arbitrary f).  It also
works with ParallelReader (and the ParallelWriter I've already
contributed).  This allows the mutable fields to be segregated into a
separate subindex.  Only that subindex need be updated.  This alone is
an enormous advantage over a large number of delete/add's where the same
optimization is not possible due to the doc-id synchronization
requirements of ParallelReader.

There is a substantial amount of code required to do this, and it is
completely dependent on the index representation.  To simplify merge
issues with ongoing Lucene changes, I had to copy and edit certain
private methods out of the existing index code (and make extensive use
of the package-only api's).  Beyond normal benefits of open sourcing
code, my interest in contributing this is to see the index code
refactored to take bulk update into account.  This is increased by the
current focus on a new flexible index representation.  I would like to
see bulk update as one of the operations supported in the new
representation.

>> So I think you should contribute your code.  This will give us a real
>> example of having something possibly valuable, and written with 1.5
>> features, so we can finalize 1.4 vs. 1.5 discussion, probably with a
>> vote on lucene-dev.

I doubt any single contribution will change anyone's mind.  I would like
to have clarity on the 1.5 decision before deciding whether or not to
contribute this and other things.  My ParallelWriter contribution, which
also requires 1.5, is already sitting in jira.

I only work in 1.5 and use its features extensively.  I don't think
about 1.4 at all, and so have no idea how heavily dependent the code in
question is on 1.5.

Unfortunately, I won't be able to contribute anything substantial to
Lucene so long as it has a 1.4 requirement.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message