lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Updated] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
Date Thu, 07 Jul 2011 17:11:16 GMT


Michael McCandless updated LUCENE-3289:

    Attachment: LUCENE-3289.patch

Patch, I think it's ready to commit!

Separately we should think about how suggest module should set these... I left it at "costly
but perfect minimization".

> FST should allow controlling how hard builder tries to share suffixes
> ---------------------------------------------------------------------
>                 Key: LUCENE-3289
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.4, 4.0
>         Attachments: LUCENE-3289.patch, LUCENE-3289.patch
> Today we have a boolean option to the FST builder telling it whether
> it should share suffixes.
> If you turn this off, building is much faster, uses much less RAM, and
> the resulting FST is a prefix trie.  But, the FST is larger than it
> needs to be.  When it's on, the builder maintains a node hash holding
> every node seen so far in the FST -- this uses up RAM and slows things
> down.
> On a dataset that Elmer (see java-user thread "Autocompletion on large
> index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
> avg 67.3 chars per title, building with suffix sharing on took 22.5
> seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
> sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
> I think we should allow this boolean to be shade-of-gray instead:
> usually, how well suffixes can share is a function of how far they are
> from the end of the string, so, by adding a tunable N to only share
> when suffix length < N, we can let caller make reasonable tradeoffs. 

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message