lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks
Date Wed, 18 Jan 2017 09:36:26 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827721#comment-15827721
] 

Dawid Weiss commented on LUCENE-7639:
-------------------------------------

Nothing is wrong with the FST, really. A suffix array is functionally equivalent to a suffix
tree, which you could build and encode as an FST. Then any infix matching would be done similarly
to suffix array-based lookups. These are all interchangeable concepts. The way we use FSTs
in Lucene is just one particular application to solve one particular problem (essentially
implement an efficient partial prefix trie), hence my remark on the possibility of using two
automata to make an intersection of wildcards possible.

As for your patch, I really don't know how much of a problem (in typical use) prefix and infix
wildcards are. While leading wildcards are heavily used for many things (suggestions, etc.)
I can't quite remember when I needed a prefix or infix wildcard. 

In other words -- don't know whether there'll be much interest in maintaining this code if
it went into Lucene, especially considering the implementation details you mentioned (amount
of memory it requires at runtime, separate construction process, etc.). Let's see what others
think though.

> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so request
processing speed falls down. This problem can be solved with Suffix Array approach. Luckily,
Suffix Array can be constructed after Lucene start from existing index. Unfortunately, Suffix
Arrays requires a lot of RAM so we can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several threads, so we
can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it works for
all terms starting with asterisks with at least 3 consequent non-wildcard characters. This
patch do not change search results and  affects only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message