lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Term pollution from binary data
Date Thu, 08 Nov 2007 10:43:53 GMT

I like this approach: it means, at search time, you can choose to
further subsample the already subsampled (during indexing) set of
terms for the TermInfosReader index.  So you can easily turn the
knob to trade off memory usage vs IO cost/latency during searching.

I'll open an issue and work through this patch.

One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.

A static int on the class would likewise be global.  Passing down an
argument to the ctor would be good, except, it would have to be
threaded up into SegmentReader, IndexReader, etc., mutiplying the
ctors these classes already have.

We can't add a "setIndexDivisor(...)" method because the terms are
already loading (consuming too much ram) during the ctor.

This would be the perfect time to use optional named/keyword
arguments, but Java does not support them (grrrr).

What if, instead, we passed down a Properties instance to IndexReader
ctors?  Or alternatively a dedicated class, eg,
"IndexReaderInitParameters"?  The advantage of a dedicated class is
it's strongly typed at compile time, and, you could put things in
there like an optional DeletionPolicy instance as well.  I think there
are a growing list of these sorts of "advanced optional parameters
used during init" that could be handled with such an approach?

Any other options here?


"Doug Cutting" <> wrote:
> Chuck Williams wrote:
> > It appears that termIndexInterval is factored into the stored index and 
> > thus cannot be changed dynamically to work around the problem after an 
> > index has become polluted.  Other than identifying the documents 
> > containing binary data, deleting them, and then optimizing the whole 
> > index, has anybody found a better way to recover from this problem?
> Hadoop's MapFile is similar to Lucene's term index, and supports a 
> feature where only a subset of the index entries are loaded (determined 
> by  It would not be difficult to add such a feature 
> to Lucene by changing TermInfosReader#ensureIndexIsRead().
> Here's a (totally untested) patch.
> Doug

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message