lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mathieu Lecarme (JIRA)" <>
Subject [jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
Date Mon, 25 Feb 2008 20:46:51 GMT
a lexicon object for merging spellchecker and synonyms from stemming

                 Key: LUCENE-1190
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*, Search
    Affects Versions: 2.3
            Reporter: Mathieu Lecarme
         Attachments: aphone+lexicon.patch

Some Lucene features need a list of referring word. Spellchecking is the basic example, but
synonyms is an other use. Other tools can be used smoothlier with a list of words, without
disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be
built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter
should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram,
phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be considered as stable.
So, a standard Lexicon (built from wikipedia by example) can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message