lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <>
Subject RE: Multi Language support
Date Thu, 06 Mar 2003 14:57:56 GMT
Hi Günter,

I had a similar requirement for my use of Lucene. We have documents with mixed languages,
some of the text in the user's native language and some in English. We made the decision to
not use any of the stemming analyzers and index with no stop words (I didn't like the no stop
words decision, but it wasn't really my call). My analyzer tokenStream method:

    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new StandardTokenizer(reader);
        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        return result;

Do you really need stemming in your application? Do you really need stop words?

See this note
for a discussion about the advantages/disadvantages of stemming.

If you still want stop words, you can create a list that includes words from more than one
language, then use the same analyzer for all of your content.

If you still need stemming, you will probably have to give your user the ability to tell you
which language index they wish to search and you would probably be better off maintaining
separate indices for each language at that point.

Best of luck,

-----Original Message-----
From: Günter Kukies [] 
Sent: Thursday, March 06, 2003 2:08 AM
To: Lucene Users List
Subject: Multi Language support


that is what I know about indexing international documents:

1. I have a language ID
2. with this ID I choose an special Analzer for that language 
3. I can use one index for all languages

But what about searching for international documents?

I don't have a language ID, because the user is interested in documents with his native language
and a second language mostly english. So, what Analyzer do I use for searching?



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message