lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Updated: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10
Date Wed, 22 Dec 2010 17:56:06 GMT


Robert Muir updated LUCENE-2391:

    Attachment: LUCENE-2391.patch

Here's a patch to speed up the spellchecker build.

* i wired the default RamMB to IWConfig's default
* i didnt mess with the mergefactor for now (because the default is still to optimize)
* but i added an additional 'optimize' parameter so you can update your spellcheck index without
* when updating, i changed the exists() to work per-segment, so its reasonable if the index
isn't optimized.
* the exists() check now bypasses the term dictionary cache, which is stupid and just slows
it down.
* we don't do any of the exists() logic if the index is empty (this is the case for i think
solr which completely rebuilds
  and doesnt do an incremental update)
* the startXXX, endXXX, and word fields can only contain one term per document. I turned off
norms, positions,
  and tf for these.
* the gramXXX field is unchanged, i didnt want to change spellchecker scoring in any way.
But we could
  reasonably in the future likely omit norms here too since i think its gonna be very short.

scratch build time: 229,803ms
index size: 214,322,200 bytes
no-op update time (updating but there is no new terms to add): 4,619ms

scratch build time: 99,214ms
index size: 177,781,273 bytes
no-op update time: 2,504ms

i still left the optimize default on, but really i think for most users (e.g. solr) they should
mergefactor to be maybe a bit more reasonable, set optimize to false, and the scratch build

is then must faster (60,000 ms), but the no-op update time is heavier (eg 16,000ms). Still,

if you are rebuilding on every commit for smallish updates something like 20-30 seconds 
is a lot better than 100seconds, but for now I kept the defaults as is (optimizing every time).

> Spellchecker uses default IW mergefactor/ramMB settings of 300/10
> -----------------------------------------------------------------
>                 Key: LUCENE-2391
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/spellchecker
>            Reporter: Mark Miller
>            Priority: Trivial
>         Attachments: LUCENE-2391.patch
> These settings seem odd - I'd like to investigate what makes most sense here.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message