lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-3842) Analyzing Suggester
Date Sat, 03 Mar 2012 20:49:57 GMT
Analyzing Suggester
-------------------

                 Key: LUCENE-3842
                 URL: https://issues.apache.org/jira/browse/LUCENE-3842
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/spellchecker
    Affects Versions: 3.6, 4.0
            Reporter: Robert Muir


Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in
LUCENE-3801,
I think we should look at implementing suggesters that have more capabilities than just basic
prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer at both build
and query time,
such that we build a wFST with:
input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token
separator
output: surface form such as "the ghost of christmas past"
weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path operation
on the weight side (like
the test in LUCENE-3801), at the same time accumulating the output (surface form), which will
be the actual suggestion.

This allows a lot of flexibility:
* Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g.
if you type in "ghost of chr...",
  it will suggest "the ghost of christmas past"
* we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs
here, and this is not implemented!)
* this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed
form is in fact the reading,
  so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
* other general things like offering suggestions that are more "fuzzy" like using a plural
stemmer or ignoring accents or whatever.

According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000
QPS), and the FST size does not
explode (its short of twice that of a regular wFST, but this is still far smaller than TST
or JaSpell, etc).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message