lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Re: Term pollution from binary data
Date Tue, 13 Nov 2007 01:33:26 GMT
Doug Cutting wrote on 11/07/2007 09:26 AM:
> Hadoop's MapFile is similar to Lucene's term index, and supports a 
> feature where only a subset of the index entries are loaded 
> (determined by  It would not be difficult to add 
> such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().
> Here's a (totally untested) patch.

Doug, thanks for this suggestion and your quick patch.

I fleshed this out in the version of Lucene we are using, a bit after 
2.1.  There was an off-by-1 bug plus a few missing pieces.  The attached 
patch is for 2.1+, but might be useful as it at least contains the 
corrections and missing elements.  It also contains extensions to the 
tests to exercise the patch.

I tried integrating this into 2.3, but enough has changed so that it was 
not straightforward (primarily for the test case extensions -- the 
implementation seems it will apply with just a bit of manual merging).  
Unfortunately, I have so many local changes that is has become difficult 
to track the latest Lucene.  The task of syncing up will come soon.  
I'll post a proper patch against the trunk in jira at a future date if 
the issue is not already resolved before then.

Michael McCandless wrote on 11/08/2007 12:43 AM:
> I'll open an issue and work through this patch.
 Michael, I did not see the issue, else would have posted this there.  
Unfortunately, I'm pretty far behind on lucene mail these days.
> One thing is: I'd prefer to not use system property for this, since
> it's so global, but I'm not sure how to better do it.

Agree strongly that this is not global.  Whether ctors or an 
index-specific properties object or whatever, it is important to be able 
to set this on some indexes and not others in a single application.

Thanks for picking this up!


View raw message