lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Alheiros <Daniel.Alhei...@bbc.co.uk>
Subject Multi-language Tokenizers / Filters recommended?
Date Thu, 21 Jun 2007 14:11:34 GMT
Hi

I'm now considering how to improve query results on a set of languages and
would like to hear considerations based on your experience in that.

I'm using the tokenizer HTMLStringWhitespaceTokenizerFactory with the
WordDelimiterFilterFactory, LowerCaseFilterFactory and
RemoveDuplicatesTokenFilterFactory as my default config.

I need to deal with:
    English (OK)
    Spanish
    Welsh
    Chinese Simplified
    Russian
    Arabic

For Spanish and Russian I'm using the SnowballPorterFilterFactory plus the
defaults. Should I use any specific TokenizerFactory? Which one?

For Chinese I'm going to use a TokenizerFactory that returns the
CJKTokenizer (as I read a previous discussion about it) plus the default
filters. Is it OK of the filters are inadequate?

For Welsh I'm using the defaults and would like to know if you have any
recommendation related to that.

For Arabic should I use the AraMorph Analyzer (
http://www.nongnu.org/aramorph/english/lucene.html)? What other processing
should I do to have better query results.

Does anyone have stop-words and synonyms for other languages but English?

I think this discussion can became a documentation topic with examples,
how-to's and stop-words / synonyms for each language, so it would be much
simpler for those who need to deal with non-English content. What do you
think about that?

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify
the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Mime
View raw message