lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dragon Fly" <>
Subject Re: Empty fields ...
Date Wed, 19 Jul 2006 12:37:41 GMT
My index gets rebuilt every night so I probably can afford
to construct the filters right after the index is rebuilt.  How
do I check each document (for empty fields) though? Would
I use an IndexReader to loop through the documents? If so,
which method(s) in the IndexReader class should I use?
termDocs()??? Thank you.

>From: "Erick Erickson" <>
>Subject: Re: Empty fields ...
>Date: Tue, 18 Jul 2006 13:08:53 -0400
>Quoting the guys "it depends" <G>...
>At root, a filter is a bitset. So size-wise, you are using 1 bit/doc (plus
>some small overhead). Both the storage required and the time to construct
>are dependent on the characteristics of your corpus. I guess the only way
>you can answer that for your particular situation is to test with your
>corpus. I can say that I was surprised at how very fast constructing a
>filter was in my situation. Which has no relevance to your situation 
>More of "it depends" is the fluidity of your index. If you construct it 
>and don't modify it, you could consider storing your filters permanently.
>Either in files or as "special documents" in your index or perhaps even in 
>meta-data index. You can store documents of meta-data just by putting in
>fields that are in none of your other documents..... Deletions/additions 
>re-optimizations will affect the internal lucene doc IDs, so you've got to
>be careful here about synchronization...
>You could consider constructing your filters all in a bunch when you open
>your searcher. Again, depending upon whether you modify your searcher often
>will determine whether you want to do this or not.
>What I'd really recommend is that you start by constructing your filters on
>the fly, without even a caching wrapper and get some timings, mostly for
>your peace of mind. I'd also do some timings when combining filters, just
>for yucks.. There's no reason not to use a caching wrapper if you expect to
>use these filters, which will load the first user with a delay, but you can
>warm up your filters by issuing some canned queries upon startup....
>Only if constructing any filters on the fly and using a caching wrapper
>proves unsatisfactory would I move on to any kind of permanent storage.
>Premature optimization and all that....
>So, I don't have a good answer since I don't have a detailed knowledge of
>your problem, but it should be relatively easy for you to get a sense of
>whether this is a reasonable approach or not.
>Hope this helps

FREE pop-up blocking with the new MSN Toolbar  get it now!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message