lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Adams" <>
Subject RE: Multiple Language Indexing and Searching
Date Tue, 06 Sep 2005 11:39:50 GMT

Does anyone know what approach does Nutch uses?

-----Original Message-----
From: Hacking Bear [] 
Sent: 06 September 2005 12:15
Subject: Re: Multiple Language Indexing and Searching

On 9/6/05, Olivier Jaquemet <> wrote: 
> As far as your usage is concerned, it seems to be the right approach,
> and I think the StandardAnalyzer does the job pretty right when it has
> to deal with whatever language you want.

 I should look into exactly what it does. Does this StandardAnalyzer
non-European languages like Chinese?

Though, note that it won't deal with all languages' stop words but the
> English ones, unless specified at index time But then if you change
> stop words at index time, what should you use at query time, some
> it won't work well.

 I think we can easily create our own super stop-word lists by copying
whatever other language's stop word lists we can find.

But as far as I am concerned, each content (content in the sense of a
> CMS) is known to have multiple language, and each of these language
> *can* be indexed separately with no problem at all, and therefore a
> dedicated analyser could be use. So I was wondering whether my
> could be the right one of if it was over complex, and could introduce
> some problem I could not see... (My approach being: one index per 
> language)

 My suggestion would be to create one index for all languages with each 
document having a 'lang' attribute. Lucene is quite scalable right? So
should not be an issue.
 During search, you can either default to turn on the 'lang' attribute 
condition or default to off, depending on what your users want most
But it will be very easy to search multiple language documents.
> I don't know if the developpers of lucene would agree, but from what
> I've been browsing on the ML archives, those multiple language issues
> seems to arrise quite often in the mailing list, and maybe some
> like "best practices", "do's and don'ts" or "Lucene Architecture in
> multiple language environement", might be really nice to see :) If
> of you have the time and the experience to write them I'll be really
> thankful! :)

 What keywords do you use to search? Somehow, I cannot find any
about multiple language on the ML archive. I even did Google! :-) Or
maybe I 
was giving the keywords in the wrong language? :-)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message