lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: Indexing .txt file containing english, german or french alphabet
Date Mon, 26 Sep 2005 19:55:20 GMT
Otis Gospodnetic <otis_gospodnetic@yahoo.com> writes:

> For indexing text that has multiple languages.... I don't know what to
> recommend.  Well, I do - try the StandardAnalyzer and see if that
> produces satisfactory results, but you'd really need a smart analyzer
> that knows how to properly tokenize and filter words from multiple
> languages, and I haven't heard of anyone doing that here.

We have a collection of Reuters documents in 13 languages (mostly
European, but also Russian, Chinese, and Japanese) that we've indexed
successfully with our Lucene-based system.  The text is all in
standard, modern encodings.

Collection link: http://trec.nist.gov/data/reuters/reuters.html

We had no problems whatsoever on the Lucene end.  You need to take
care about how you read your text before you feed it to an analyzer,
and how you do the same with queries.

Obviously the Lucene analyzer assumes words separated by puntuation
and space, which is not so good for asian-language retrieval
performance, and of course there are no stemmers if you want that.
You're best off using some language-specific analyzer chains.  If you
don't know the language before analysis, that's a harder problem.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message