lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Copy and augment an indexed Document
Date Thu, 31 Dec 2009 01:19:44 GMT
It is possible to reconstruct a document from the terms, but
it's a lossy process. Luke does this (you can see from the
UI, and the code is available). There's no utility that I know
of to make this easy.

It's an open question whether this is more or less work than
re-parsing the document (I infer that you have the originals
available). Before trying to reconstruct the document I'd
ask how often you need to do this. The gremlins coming out
of the woodwork from reconstruction would consume a lot
of resources. For instance, could you guarantee term positions
when you re-indexed the reconstructed document?
What about stemmed words? How about synonyms?
And are you willing to spend the time tracking down the
bugs?

As I said, I wouldn't go there until forced....

FWIW
Erick

On Wed, Dec 30, 2009 at 5:08 PM, tsuraan <tsuraan@gmail.com> wrote:

> Suppose I have a (useful) document stored in a Lucene index, and I
> have a variant that I'd also like to be able to search.  This variant
> has the exact same data as the original document, but with some extra
> fields.  I'd like to be able to use an IndexReader to get the document
> that I stored, use the document's add method to put my extra fields
> in, and then add that document to the index using an IndexWriter.
> This doesn't seem to work, in general.  Non-stored fields of the
> original document are not in the variant document.  This makes sense
> from an OO point of view (how would that document object possibly have
> the non-stored data of the original doc), but is there some
> lower-level way to do what I want to do?
>
> It's somewhat expensive to completely re-create my document, as it
> relies on data parsed from (often large) pdf and MS Office files.  I'd
> like to be able to use the already-stored terms that are in my index
> and associated with my existing document.  Can I iterate through the
> terms of my index and add references to my newly-added document?  Is
> there any utility to make this work nicely?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message