lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3842) Analyzing Suggester
Date Fri, 28 Sep 2012 17:47:08 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465778#comment-13465778
] 

Michael McCandless commented on LUCENE-3842:
--------------------------------------------

Thanks Rob, good feedback ... I'll post new patch changing that posInc check to an assert,
and removing that obsolete NOTE.

{quote}
As far as the limitations, i feel like if the last token's endOffset != length of input
that might be pretty safe in general (e.g. standardtokenizer) because of how unicode
works... i have to think about it.
{quote}

I think we should try that!  This way the suggester can "guess" whether the input text is
still inside the last token.

But this won't help the StopFilter case, ie if user types 'a' then StopFilter will still delete
it even though the token isn't "done" (ie maybe user intends to type 'apple').

Still it's progress so I think we should try it ...

I'm not sure why FST is so much larger ... the outputs should share very well with KeywordTokenizer
... hmm what weights do we use for the benchmark?
                
> Analyzing Suggester
> -------------------
>
>                 Key: LUCENE-3842
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3842
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/spellchecker
>    Affects Versions: 3.6, 4.0-ALPHA
>            Reporter: Robert Muir
>         Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator
in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just
basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both
build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token
separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation
on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which
will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords,
e.g. if you type in "ghost of chr...",
>   it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs
here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where
the analyzed form is in fact the reading,
>   so we would add a TokenFilter that copies ReadingAttribute into term text to support
that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural
stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g.
~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than
TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message