lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Storing extra data in index
Date Tue, 27 Feb 2007 17:50:25 GMT
Keep in mind that you'll have to store the length as you index. If you
tried to store the length with each document as a post-step, you'd
delete and re-add the document to the index...

That said, it's really up to you. It's very quick to use TermEnum/
TermDocs to enumerate all the lengths. Even though this works
with Lucene doc IDs, it's OK since you're working on a
snapshot of the index. They won't change before you
close your reader (and presumably re-read the data).

Or, you can simply create some sort of unique ID for each
doc that's entirely independent of the Lucene ID and store
*that* id along with the length in your meta-data. Whichever
you think would suite your needs better.

Which is best you'll only discover by testing in your situation.
I suspect either will be "good enough".

Erick

On 2/27/07, Mike O'Leary <tm-oleary@comcast.net> wrote:
>
> So if I wanted to record the length of each individual document, would it
> be
> better to store that information with each document, perhaps as an
> unindexed
> field? Or are there ways to refer to the indexed documents that don't
> change
> through delete and optimize steps? Thanks.
>
> Mike O'Leary
>
>   _____
>
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, February 27, 2007 9:22 AM
> To: java-user@lucene.apache.org; tm-oleary@comcast.net
> Subject: Re: Storing extra data in index
>
>
>
> You can just add a document. I used this technique in an application,
> and it hinges upon realizing that not all documents in an index need
> to have the same fields. So, say your regular documents have
> fields f1, f2, f3...fn. Create a special document with fields
> s1, s2, s3, s4 that contain your meta data. Whenever you add
> more "regular" documents to the index you can modify
> the special document as necessary.
>
> The beauty of this is that as long as the special document
> contains no fields in common with your regular documents,
> you'll never have it returned by searches because the fields
> are disjoint. And searches to find it will be very fast because
> there's only one.
>
> You can take this as far as you like. For instance, you
> could store a field (no need to even index it!) that
> contains, say, an XML version of all the meta-data
> you want to use in your special document. Perhaps
> you want to read this document in at startup and
> store it in a convenient form. Or.....
>
> If you go this route, you may want to consider creating and storing
> the meta-document as a post-build step. I was surprised at how
> quickly I could traverse an index and build up the meta-data
> document after I'd finished with all of the "regular" processing.
>
> One caution, however; I'd be very careful about storing Lucene
> document Ids in my meta-data document since they may change
> if you delete documents and then optimize your index. In fact, they
> WILL change.
>
> BTW, I thoroughly approve of keeping all the parts you can
> in the index, since that's fewer things to keep track of.
>
> Hope this helps
> Erick
>
> On 2/27/07, Mike O'Leary <tm-oleary@comcast.net> wrote:
>
> Is there a standard programming idiom for adding extra data to an index
> that
> has been created? I am trying to write code to index and search a set of
> documents using the BM25 algorithm, so (as I understand it) I need to
> store
> the length of each document somewhere and the average document length for
> the collection somewhere (and, I guess, the number of documents that have
> been indexed at any point so I can keep a running average). It seems like
> it
>
> would make sense to store these values in the index somehow so they are
> available to the search code. Is there sample code somewhere that
> describes
> how to do something like this? Or is there a better way that I'm not
> thinking of? Thanks.
>
> Mike O'Leary
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message