lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
Date Tue, 27 Apr 2010 22:03:40 GMT


Robert Muir commented on LUCENE-1190:

Hi Otis, I took a look, and followed the blog link and explored 
the linked svn there (it was easier than reading the patch).

I guess the interesting approach I see here is what looks to be 
some generation of phonetic filters (similar to the ones in Solr)
from aspell resources. 

Honestly though, I am not knowledgeable on aspell to know
to what degree this would work for some of these languages,
and how it would compare to things like metaphone.

So, we could potentially use this idea if people wanted some
more phonetic 'hash' functions available for specific languages,
but I have a few concerns:
* I do not know the license of the aspell resources these were gen'ed from
* As mentioned above, I don't know the quality.
* I think it would be preferable for the filter to work from the aspell files rather 
than gen'ing code if possible

As far as what hunspell offers in comparison, I am not sure that
it has this, instead it offers things like typical replacements that
can be attempted for spellchecking and such.. Chris Male might
know more as he has really been the one digging in.

> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>                 Key: LUCENE-1190
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
> Some Lucene features need a list of referring word. Spellchecking is the basic example,
but synonyms is an other use. Other tools can be used smoothlier with a list of words, without
disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can
be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter
should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word,
ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as
stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message