lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong-Thai Nguyen" <>
Subject RE: Where to find non-English dictionaries, thesaurus, synonyms
Date Thu, 06 Jan 2011 17:38:49 GMT

I'm not sure these non-English spellcheckers, analyzers and related resources are good idea
in real usage. English grammar is quite simple and can be captured in Porter's rules, but
others so different. For example, Porter's rules can not work well in French grammar, neither
in Asian languages. Languages libraries providing in Lucene are practical, but they are satisfy
in real usage? Need an evaluation effort in each language.

Even you have a special language, it's still difficult to decide "satisfy level". You may
need simply Lucene analyser in general search, but in sophisticate content with high quality
required result, you may need a morpho-syntax or semantic analyzer in this context. And these
analyzers are so expensive (in cost, and in processing time).

Develop a language spellchecker, analyzer, stemmer, ... and its dictionaries is still difficult
and out of developer's scope. And Search Provider keeps these advanced features in private.

For each language, you can find out a library (mostly in research context) for spellchecker,
analyzer. You have to integrate in Lucene.
Apache Open-NLP project is a nice effort to collect languages NLP works around the world,
then integrate in common platform:

Princeton Wordnet is concept's definition, not dictionary. These is some other languages:

I'm wondering how Wordnet is useful in search context. You can may uses synsets (synonyms)
like a suggestion dictionary. But stopwords, stem and analyzer dictionaries are dependant
to associate modules.


-----Message d'origine-----
De : Pulkit Singhal [] 
Envoyé : jeudi 6 janvier 2011 17:54
À :
Objet : Where to find non-English dictionaries, thesaurus, synonyms


What's a good source to get dictionaries (for spellcorrections) and/or
thesaurus (for synonyms) that can be used with Lucene for non-English
languages such as Fresh, Chinese, Korean etc?

For example, the wordnet contrib module is based on the data set
provided by the Princeton based wordnet system but I'm wondering where
the Lucene users go for similar reliable source for other languages?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message