lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary
Date Sun, 25 Mar 2012 15:47:28 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

I updated the patch and fixed Koji's test, its passing BUT there is a nocommit:
{code}
// nocommit: we need to fix SuggestWord to separate surface and analyzed forms.
// currently the 're-rank' is based on the surface forms!
spellChecker.setAccuracy(0F);
{code}

To explain with the Japanese case how the patch currently works, the spellchecker has two
phases:
* Phase 1: n-gram approximation phase. Here we generate a n-gram boolean query on the Readings.
This is working fine.
* Phase 2: re-rank phase. Here we take the candidates from Phase 1 and do a real comparison
(e.g. Levenshtein) to give them the final score. The problem is this currently uses surface
form!

I think phase 2 should re-rank based on the 'analyzed form' too? Inside spellchecker itself,
I don't think this is very difficult, when analyzed != surface, we just store it for later
retrieval.

The problem is the spellcheck comparison APIs such as SuggestWord don't even have any getters
or setters and present no way for me to migrate to surface+analyzed in any backwards compatible
way...

I'll think about this in the meantime. Maybe we should just break and cleanup these APIs since
its a contrib module and they are funky? 

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch,
LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese
environment unfortunately and is the longstanding problem, because the logic needs comparatively
long text to check spells, but for some languages (e.g. Japanese), most words are too short
to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell
check word and surface form in the spell check dictionary. Then we can use ReadingAttribute
for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message