lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <>
Subject [jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
Date Wed, 21 Nov 2007 17:54:43 GMT


Doug Cutting commented on LUCENE-1052:

> We find a surprising number of them contain embedded encoded binary data.

It sounds like a detector for this would be very useful.  It would, e.g., substantially speed
updates of such indexes, and not slow searches of them like a divisor does.  At Excite we
evolved effective heuristics for wordness to keep our dictionaries from exploding.  Perhaps
you should look into that?  Also, it sounds like you might increase your default term index
interval, since it sounds like you have big indexes with noisy data.

> Our users won't accept a solution like, wait until the problem occurs and then increment
your termIndexDivisor. They expect our app to manage this automatically.

You could look at the size of the .tii files before you open an index, and, if they're too
large, set the divisor automatically as you see fit.

> int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This sounds like a fine approach.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>                 Key: LUCENE-1052
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message