lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <>
Subject [jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
Date Sun, 18 Nov 2007 17:50:43 GMT


Chuck Williams commented on LUCENE-1052:

I believe this needs to be a formula as a reasonable bound on the number of terms is in general
a function of the number of documents in the segment and the nature of the index (e.g., types
of fields).  A common thing to do would be to enforce that RAM usage for cached terms grows
no faster than logarithmically in the number of documents.  The specific formula that is appropriate
will depend on the index, i.e. on the application.  It might be of the form:  c*ln(numdocs+k),
wnere c and k are constants dependent on the index.

One consequence of this approach, or any approach along these lines, is that the indexDivisor
will vary across the segments, both in a single index and across indexes.  It seems to me
from the code that this should work fine.

This leaves the issue of how to best specify an arbitrary formula.  This requires a method
to compute the max cached terms allowed for a segment based on the number of docs in the segment,
the number of terms in the segment's index, and possibly other factors.  The most direct way
to do this is to introduce an interface, e.g. TermInfosConfigurer, to define the method signature,
and to add setTermInfosConfigurer as an alternative to setTermInfosIndexDivisor.  It would
need to be in all the same places.

A more general approach would be to introduce an IndexConfigurer class which over time could
hold additional methods like this.  It could even replace the current setters on IndexReader
(as well as IndexWriter, etc.) with a more general mechanism that would allow dynamic parameters
used to configure any classes in the index structure.  Each constructor would be passed the
IndexConfigurer and call getters or other methods on it to obtain its config.  The methods
could provide constant values or dynamic formulas.

I'm going to implement the straightforward solution at the moment in our older version of
Lucene, then will sync up to whatever you guys decide is best for the trunk later.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>                 Key: LUCENE-1052
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>         Attachments: LUCENE-1052.patch
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message