lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Storing extra data in index
Date Tue, 27 Feb 2007 17:22:01 GMT
You can just add a document. I used this technique in an application,
and it hinges upon realizing that not all documents in an index need
to have the same fields. So, say your regular documents have
fields f1, f2, f3...fn. Create a special document with fields
s1, s2, s3, s4 that contain your meta data. Whenever you add
more "regular" documents to the index you can modify
the special document as necessary.

The beauty of this is that as long as the special document
contains no fields in common with your regular documents,
you'll never have it returned by searches because the fields
are disjoint. And searches to find it will be very fast because
there's only one.

You can take this as far as you like. For instance, you
could store a field (no need to even index it!) that
contains, say, an XML version of all the meta-data
you want to use in your special document. Perhaps
you want to read this document in at startup and
store it in a convenient form. Or.....

If you go this route, you may want to consider creating and storing
the meta-document as a post-build step. I was surprised at how
quickly I could traverse an index and build up the meta-data
document after I'd finished with all of the "regular" processing.

One caution, however; I'd be very careful about storing Lucene
document Ids in my meta-data document since they may change
if you delete documents and then optimize your index. In fact, they
WILL change.

BTW, I thoroughly approve of keeping all the parts you can
in the index, since that's fewer things to keep track of.

Hope this helps
Erick

On 2/27/07, Mike O'Leary <tm-oleary@comcast.net> wrote:
>
> Is there a standard programming idiom for adding extra data to an index
> that
> has been created? I am trying to write code to index and search a set of
> documents using the BM25 algorithm, so (as I understand it) I need to
> store
> the length of each document somewhere and the average document length for
> the collection somewhere (and, I guess, the number of documents that have
> been indexed at any point so I can keep a running average). It seems like
> it
> would make sense to store these values in the index somehow so they are
> available to the search code. Is there sample code somewhere that
> describes
> how to do something like this? Or is there a better way that I'm not
> thinking of? Thanks.
>
> Mike O'Leary
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message