lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Term pollution from binary data
Date Thu, 08 Nov 2007 18:04:50 GMT
Michael McCandless wrote:
> One thing is: I'd prefer to not use system property for this, since
> it's so global, but I'm not sure how to better do it.

I agree.  That was the quick-and-dirty hack.  Ideally it should be a 
method on IndexReader.  I can think of two ways to do that:

1. Add a generic method like IndexReader#setProperty(String,String).
2. Add a specific method like IndexReader#setTermIndexDivisor(int).

I slightly prefer the former, as it permits various IndexReaders 
implementations to support arbitrary properties, at the expense of being 
untyped, but that might be overkill.  Thoughts?

> We can't add a "setIndexDivisor(...)" method because the terms are
> already loading (consuming too much ram) during the ctor.

Aren't indexes loaded lazily?  That's an important optimization for 
merging, no?  For performance reasons, opening an IndexReader shouldn't 
do much more than open files.  However, if we build a more generic 
mechanism, we should not rely on that.

> What if, instead, we passed down a Properties instance to IndexReader
> ctors?  Or alternatively a dedicated class, eg,
> "IndexReaderInitParameters"?  The advantage of a dedicated class is
> it's strongly typed at compile time, and, you could put things in
> there like an optional DeletionPolicy instance as well.  I think there
> are a growing list of these sorts of "advanced optional parameters
> used during init" that could be handled with such an approach?

(I probably should have read your entire message before starting to 
respond...  But it's nice to see that we think alike!)  This is similar 
to my (2) approach, but attempts to solve the typing issue, although I'm 
not sure how...

The way we handle it in Hadoop is to pass around a <String,String> map 
in the abstract kernel, then have concrete implementation classes 
provide static methods that access it.  So this might look something like:

public class LuceneProperties extends Properties {
   // utility methods to handle conversion of values to and from Strings
   void setInt(String prop, int value);
   int getInt(String prop);
   void setClass(String prop, Class value);
   Class getClass(String prop);
   Object newInstance(String prop)

public class SegmentReaderProperties {
   private static final String DIVISOR_PROP =
   public static setTermIndexDivisor(LuceneProperties props, int i) {
     props.setInt(DIVISOR_PROP, i);

Then the IndexReader constructor methods could accept a 
LuceneProperties.  No point in making this IndexReader specific, since 
it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.

An advantage of a <String,String> map over a <String,Object> map for 
Hadoop is that it's trivial to serialize.

Is this what you had in mind?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message