lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Term pollution from binary data
Date Wed, 07 Nov 2007 19:26:31 GMT
Chuck Williams wrote:
> It appears that termIndexInterval is factored into the stored index and 
> thus cannot be changed dynamically to work around the problem after an 
> index has become polluted.  Other than identifying the documents 
> containing binary data, deleting them, and then optimizing the whole 
> index, has anybody found a better way to recover from this problem?

Hadoop's MapFile is similar to Lucene's term index, and supports a 
feature where only a subset of the index entries are loaded (determined 
by io.map.index.skip).  It would not be difficult to add such a feature 
to Lucene by changing TermInfosReader#ensureIndexIsRead().

Here's a (totally untested) patch.

Doug

Mime
View raw message