lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary
Date Tue, 20 Mar 2012 08:07:52 GMT


Robert Muir commented on LUCENE-3888:

Koji: hmm I think the problem is not in the Dictionary interface (which is actually ok),
but instead in the spellcheckers and suggesters themselves?

For spellchecking, I think we need to expose more Analysis options in Spellchecker:
currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED). 
Instead I think you should be able to pass Analyzer: we would also
have a TokenFilter for Japanese that replaces term text with Reading from ReadingAttribute.

In the same way, suggest can analyze too. (LUCENE-3842 is already some work for that, especially
with the idea to support Japanese this exact same way).

So in short I think we should:
# create a TokenFilter (similar to BaseFormFilter) which copies ReadingAttribute into termAtt.
# refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams (this can
  also likely be implemented as tokenstreams), allowing user to set an Analyzer on Spellchecker
  to control how it analyzes text.
# continue to work on 'analysis for suggest' like LUCENE-3842.

Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current use of 'analyzers'

which is only for some query manipulation and not very useful.

> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>                 Key: LUCENE-3888
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>         Attachments: LUCENE-3888.patch
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese
environment unfortunately and is the longstanding problem, because the logic needs comparatively
long text to check spells, but for some languages (e.g. Japanese), most words are too short
to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell
check word and surface form in the spell check dictionary. Then we can use ReadingAttribute
for spell checking but CharTermAttribute for suggesting, for example.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message