lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <>
Subject [jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
Date Mon, 19 Nov 2007 20:26:43 GMT


Chuck Williams updated LUCENE-1052:

    Attachment: termInfosConfigurer.patch

termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow dynamic management
of this parameter.  A new new interface, TermInfosConfigurer, allows specification of a method,
getMaxTermsCached(), that bounds the size of the in-memory term infos as a function of the
segment name, segment numDocs, and total segment terms.  This bound is then used to automatically
set termInfosIndexDivisor whenever a TermInfosReader reads the term index.  This mechanism
provides a simple way to ensure that the total amount of memory consumed by the term cache
is bounded by, say, O(log(numDocs)).

All Lucene core tests pass.  I'm using another version of this same patch in Lucene 2.1+ in
an application that has indexes with binary term pollution, using the TermInfosConfigurer
to dynamically bound the term cache in the polluted segments.

Tried to test contrib, but it appears gdata-server needs external libraries I don't have to

Michael, this patch applies cleanly to today's Lucene trunk.  I'd appreciate if you could
verify one thing.  Lucene 2.3 has the incremental reopen mechanism (can't wait to get that!),
new since Lucene 2.1.  It appears that reopen of a segment reuses the same TermInfosReader
and thus does not need to configure a new one.  I've implemented that part of the patch with
this assumption.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>                 Key: LUCENE-1052
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message