lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject incremental document field update
Date Thu, 14 Jan 2010 09:23:39 GMT
Hi,

I've been thinking about how to update a single field of a document
without touching its other fields. This is an old problem and I was
considering a solution along the lines of Andrzej Bialecki's post to
the dev list back in '07:


<quote  http://markmail.org/message/tbkgmnilhvrt6bii >

I have the following scenario: I want to use ParallelReader to
maintain parts of the index that are changing quickly, and where
changes are limited to specific fields only.

Let's say I have a "main" index (many fields, slowly changing, large
updates), and an "aux" index (fast changing, usually single doc and
single field updates). I'd like to "replace" documents in the "aux"
index - that is, delete one doc and add another - but in a way that
doesn't change the internal document numbers, so that I can keep the
mapping required by ParallelReader intact.

I think this is possible to achieve by using a FilterIndexReader,
which keeps a map of updated documents, and re-maps old doc ids to the
new ones on the fly.

>From time to time I'd like to optimize the "aux" index to get rid of
deleted docs. At this time I need to figure out how to preserve the
old->new mapping during the optimization.

So, here's the question: is this scenario feasible? If so, then in the
trunk/ version of Lucene, is there any way to figure out (predictably)
how internal document numbers are reassigned after calling optimize()
?

</quote>


Reading that trail, I wish the original poster gave up on his idea (
http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
)


<quote>
Thanks for the input - for now I gave up on this, after discovering
that I would have no way to ensure in TermDocs.skipTo() that document
id-s are monotonically increasing (which seems to be a part of the
contract).
</quote>

I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
(ordered) maps, one from internal document-ids to "view" document-ids,
and another mapping from  "view" document-ids to internal
document-ids, then things like skipTo() can be implemented reasonably
efficiently. Only the mapped ids are maintained in these structures.
(Also note that a mapped "view" document-id represents an internally
deleted document with that id.)

And if we can find a way to merge the segments of this "aux" index
along whenever the segments of its associated "main" index are merged
or optimized (such that the [internal] doc-ids in the merged aux index
end up getting sync'ed with those of the trunk), then there shouldn't
be all that many doc-ids to map anyway (if we merge frequently
enough).

So to go back Andrzej's question: is there any way to figure out
(predictably) how internal document numbers [in the main index] are
reassigned after calling optimize() ? How does LUCENE-847, as Doug
Cutting suggests in that trail, help?

Sorry if that was long winded, had to start somewhere ;)

-Babak

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message