lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
Date Thu, 07 Jul 2011 17:33:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061453#comment-13061453
] 

Robert Muir commented on LUCENE-3289:
-------------------------------------

I think thats probably good for most cases?

In the example you gave, it seems that FST might not be the best algorithm? The strings are
extremely long (more like short documents) and probably need to be "compressed" in some different
datastructure, e.g. a word-based one?

> FST should allow controlling how hard builder tries to share suffixes
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3289
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3289
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3289.patch, LUCENE-3289.patch
>
>
> Today we have a boolean option to the FST builder telling it whether
> it should share suffixes.
> If you turn this off, building is much faster, uses much less RAM, and
> the resulting FST is a prefix trie.  But, the FST is larger than it
> needs to be.  When it's on, the builder maintains a node hash holding
> every node seen so far in the FST -- this uses up RAM and slows things
> down.
> On a dataset that Elmer (see java-user thread "Autocompletion on large
> index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
> avg 67.3 chars per title, building with suffix sharing on took 22.5
> seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
> sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
> I think we should allow this boolean to be shade-of-gray instead:
> usually, how well suffixes can share is a function of how far they are
> from the end of the string, so, by adding a tunable N to only share
> when suffix length < N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message