lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mathieu Lecarme (JIRA)" <>
Subject [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
Date Fri, 29 Feb 2008 19:12:51 GMT


Mathieu Lecarme commented on LUCENE-1190:

News features:
helper to extends query with similarity of each term :
+type:dog +name:rintint*
will become:
+type:(+dog (dogs doggy)^0.7) +name:rintint*

"Do you mean pattern" packaged over IndexSearcher. If search result is under a thresold, sorted
suggestion list for each term is provided, and a rewritten query sentence:
will become:

> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>                 Key: LUCENE-1190
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
> Some Lucene features need a list of referring word. Spellchecking is the basic example,
but synonyms is an other use. Other tools can be used smoothlier with a list of words, without
disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can
be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter
should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word,
ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as
stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message