lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject RE: Format Stripping [ was: XLS parser ]
Date Tue, 22 Jan 2002 17:37:25 GMT
> From: Brian Goetz []
> I like the idea of being able to add fields to a Document after the
> Document is indexed.  Then, for documents with a long 'body' and short
> metadata fields, you could process the body through an InputStream
> adapter, which would, as a side effect, store the other fields
> somewhere, and then add them.  Doug, how hard would this be to support
> adding some new fields to an already indexed document?

Before a document can be added to an index Lucene must sort all of the terms
in it, and thus it must have all of these terms.

It could be changed.  Some background:  When a document is added, it is
written as a segment.  Segments are each complete indexes, containing
documents numbered from zero.  To keep from having to search too many
segments, segments are periodically merged.  When segments are merged,
documents in all but the first are re-numbered.  For example, merging two
segments each containing three documents numbered 0, 1, and 2 creates a new
segment containing documents numbered 0 through 5.  If there are deleted
documents, then more re-numbering happens as deleted documents are dropped.

Segments and index contents are also merged "softly", on the fly, by
SegmentsReader and MultiSearcher, which permit searching of multiple
segments or entire indexes.  These on-the-fly merges also re-number, softly.

In order to add partial documents we'd need to change things so that
segments can be merged without renumbering.  A document could be assigned a
number when it is created.  A segment could be written containing some of
its terms, and another segment could be written containing more.  (For
merging to be efficient, we'd probably need to require that all segments of
a document were added before another document is added.)  Then merging (hard
or soft) would combine the segments of a document for search.

A renumbering merge would still be required to remove deleted document
numbers.  Lucene uses arrays indexed by document number for a few things, so
this is required to keep these arrays from getting too big.  It also helps
with index compression.

Someday when I have the time I can look more closely at how hard this would
be to implement.  It would certainly require changes to lots of code!


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message