lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
Date Thu, 07 Jul 2011 10:52:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061195#comment-13061195
] 

Michael McCandless commented on LUCENE-3289:
--------------------------------------------

NOTE: patch applies to 3.x.

I ran the patch on the titles, varying the max prefix sharing length:

||Len||FST Size||Seconds||
|1|135446807|8.2|
|2|137632702|8.5|
|3|135177994|8.3|
|4|132782016|8.3|
|5|130415331|8.4|
|6|128086200|8.0|
|7|125797396|8.2|
|8|123552157|8.5|
|9|121358375|8.4|
|10|119228942|8.1|
|11|117181180|8.8|
|12|115229788|8.7|
|13|113388260|9.5|
|14|111664442|9.0|
|15|110059167|9.2|
|16|108572519|9.7|
|17|107201905|9.8|
|18|105942576|10.3|
|19|104791497|10.1|
|20|103745678|11.1|
|21|102801693|10.8|
|22|101957797|11.4|
|23|101206564|11.1|
|24|100541849|11.0|
|25|99956443|11.1|
|26|99443232|12.9|
|27|98995194|13.2|
|28|98604680|13.9|
|29|98264184|13.5|
|30|97969241|13.6|
|31|97714049|13.8|
|32|97494104|14.3|
|33|97304045|14.0|
|34|97140033|14.3|
|35|96998942|14.6|
|36|96877590|16.5|
|37|96773039|16.9|
|38|96682961|16.6|
|39|96605160|17.8|
|40|96537687|18.3|
|41|96479286|17.8|
|42|96428710|17.5|
|43|96384659|18.9|
|44|96346174|17.0|
|45|96312826|19.3|
|46|96283545|17.8|
|47|96257708|19.4|
|48|96235159|19.0|
|49|96215220|18.7|
|50|96197450|19.6|
|51|96181539|17.3|
|52|96167235|16.9|
|53|96154490|17.7|
|54|96143081|18.8|
|55|96132905|17.4|
|56|96123776|17.5|
|57|96115462|20.7|
|58|96108051|19.2|
|59|96101249|19.1|
|60|96095107|18.7|
|ALL|96020343|22.5|

Very very odd that FST size first goes up at N=2... not yet sure why.  But from this curve
it looks like there is a sweet spot around maybe N=24.  I didn't measure required heap here,
but it also will go down as N goes down.


> FST should allow controlling how hard builder tries to share suffixes
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3289
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3289
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3289.patch
>
>
> Today we have a boolean option to the FST builder telling it whether
> it should share suffixes.
> If you turn this off, building is much faster, uses much less RAM, and
> the resulting FST is a prefix trie.  But, the FST is larger than it
> needs to be.  When it's on, the builder maintains a node hash holding
> every node seen so far in the FST -- this uses up RAM and slows things
> down.
> On a dataset that Elmer (see java-user thread "Autocompletion on large
> index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
> avg 67.3 chars per title, building with suffix sharing on took 22.5
> seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
> sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
> I think we should allow this boolean to be shade-of-gray instead:
> usually, how well suffixes can share is a function of how far they are
> from the end of the string, so, by adding a tunable N to only share
> when suffix length < N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message