lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tm-ole...@comcast.net>
Subject RE: Storing extra data in index
Date Tue, 27 Feb 2007 17:30:33 GMT
So if I wanted to record the length of each individual document, would it be
better to store that information with each document, perhaps as an unindexed
field? Or are there ways to refer to the indexed documents that don't change
through delete and optimize steps? Thanks.

Mike O'Leary

  _____  

From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, February 27, 2007 9:22 AM
To: java-user@lucene.apache.org; tm-oleary@comcast.net
Subject: Re: Storing extra data in index

 

You can just add a document. I used this technique in an application, 
and it hinges upon realizing that not all documents in an index need
to have the same fields. So, say your regular documents have
fields f1, f2, f3...fn. Create a special document with fields 
s1, s2, s3, s4 that contain your meta data. Whenever you add
more "regular" documents to the index you can modify
the special document as necessary.

The beauty of this is that as long as the special document 
contains no fields in common with your regular documents,
you'll never have it returned by searches because the fields
are disjoint. And searches to find it will be very fast because
there's only one. 

You can take this as far as you like. For instance, you
could store a field (no need to even index it!) that 
contains, say, an XML version of all the meta-data
you want to use in your special document. Perhaps 
you want to read this document in at startup and
store it in a convenient form. Or.....

If you go this route, you may want to consider creating and storing
the meta-document as a post-build step. I was surprised at how 
quickly I could traverse an index and build up the meta-data
document after I'd finished with all of the "regular" processing.

One caution, however; I'd be very careful about storing Lucene 
document Ids in my meta-data document since they may change
if you delete documents and then optimize your index. In fact, they
WILL change.

BTW, I thoroughly approve of keeping all the parts you can
in the index, since that's fewer things to keep track of. 

Hope this helps
Erick

On 2/27/07, Mike O'Leary <tm-oleary@comcast.net> wrote:

Is there a standard programming idiom for adding extra data to an index that
has been created? I am trying to write code to index and search a set of
documents using the BM25 algorithm, so (as I understand it) I need to store 
the length of each document somewhere and the average document length for
the collection somewhere (and, I guess, the number of documents that have
been indexed at any point so I can keep a running average). It seems like it

would make sense to store these values in the index somehow so they are
available to the search code. Is there sample code somewhere that describes
how to do something like this? Or is there a better way that I'm not 
thinking of? Thanks.

Mike O'Leary

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message