lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9429) Spellcheck Token Filter
Date Wed, 24 Aug 2016 13:13:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434901#comment-15434901
] 

David Smiley commented on SOLR-9429:
------------------------------------

Ha! Neat idea.  I have no idea if in-practice it's actually a good idea or not but it's clever
nonetheless.  I suspect one would want only very high confidence typos resolved in this way....
and that you might want the original uncorrected word somehow stored in some way, perhaps
in a payload or perhaps index both such that this proposed token filter introduces synonyms,
leaving the original dubious word in place.

Something like this could be contributed to the Lucene spellcheck module.

> Spellcheck Token Filter
> -----------------------
>
>                 Key: SOLR-9429
>                 URL: https://issues.apache.org/jira/browse/SOLR-9429
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>            Reporter: Alessandro Benedetti
>            Priority: Minor
>
> This issue is about the design and implementation of a new token filter called : SpellcheckTokenFilter
> This new token filter takes in input the token stream and return collated tokens, based
on a Dictionary.
> The aim of the token filter is to fix mispelled word and index the correct token.
> e.g.
> Given dictionary d1 :
> gaming
> gamer
> Given text t1 for the field f1 :
> gamign is a strong industry
> The token filter will return in output :
> gaming is a strong industry
> A first possible design is to mimic the approach used in the spellchecker.
> Building an FST for the dictionary, then building the levenstein FST for each token and
doing the intersection .
> Possible application could be for OCR generated text and other use cases when misspelled
words are common and we want to clean them up at indexing time.
> This can possibly be used in a complex analyser adding a stemmer afterward.
> This is draft idea coming from a blog comment of Shyamsunder.
> Feedback and additional ideas are welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message