lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
Date Wed, 21 Nov 2007 20:24:43 GMT


Michael McCandless updated LUCENE-1052:

    Attachment: LUCENE-1052.patch

What class would we put TermInfosReader-specific setters & getters on, since that class
is not public? Do we make TermInfosReader public or leave it package-private? My intuition
is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking
applications, and because making it public would drag a lot of other stuff into the public.
We could consider making SegmentReader public, so that there's a public class that corresponds
to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader).

Agreed: package private.  People who do advanced things should be fine
with that.

Another option is to make a public class whose purpose is just to only such parameters, something
like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in
my initial proposal, but with considerably less explanation.

So I took a closer look at making generic properties by coding up
Doug's approach (attached patch).

I replaced *#setTermInfosIndexDivisor with a separate
SegmentIndexProperties class that has static methods to set/get
termIndexDivisor, and added/threaded down ctors that allow you to pass
a LuceneProperties when opening an IndexReader.

I came up with a number of questions along the way:

  * Who should know/store the default value for a given property?
    TermIndexDivisor defaults to 1.
    Is this stored in that static facade class (a)?  Or, passed in as
    defaultValue arg by TermInfosReader when it looks up the property
    (b)?  Or, do we make a base DefaultLuceneProperties that has the
    default set for all properties (c)?
    (b) is nice because I feel like the default should live in the
    class that uses it, but then that's bad because the outside world
    can't see the default value.

  * Every property must clearly define when it will be looked at.  So
    for termIndexDivisor in the javadoc we would say "it's used only
    when the termInfos index is loaded (once)".  This means changing
    that property after termInfos index is loaded has no effect.

  * We should presumably create a default LuceneProperties to save
    checking for props != null everywhere when user didn't make their
    own props.  This favors option (c) in the first bullet above.

  * Presumably once you've created a class, passing in your props
    instance, you cannot later install a new props instance.  The
    LuceneProperties class is "write once".

  * We would need guidelines for when something should be an arg to
    the ctor, setter/getter on the class.  I think there are shades of
    gray here.

After this, I suddenly realized if we indeed make termIndexDivisor a
generic property, it's actually hard for Chuck to then do his formula
by looking at the size of the .tii file: when the index has multiple
segments, you would presumably need to set different indexDivisors for
each segment, but the properties only lets you set one global value.

You could carefully set the property, then somehow get ahold of just
that one SegmentReader and have it load the term index, then move onto
the next one, etc, but that's quite messy.

Note that this limitation is also the case with the top-level
setTermInfosIndexDivisor as it now stands in trunk -- it's not easy to
set different index divisors per segment.

It almost feels like we should have "hooks" that are invoked at
certain times, like when we are about to load the term infos index,
that give the application a chance to change something...

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>                 Key: LUCENE-1052
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message