lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hacking Bear <hackingb...@gmail.com>
Subject Re: Multiple Language Indexing and Searching
Date Tue, 06 Sep 2005 11:15:11 GMT
On 9/6/05, Olivier Jaquemet <olivier.jaquemet@jalios.com> wrote: 
> 
> As far as your usage is concerned, it seems to be the right approach,
> and I think the StandardAnalyzer does the job pretty right when it has
> to deal with whatever language you want.

 I should look into exactly what it does. Does this StandardAnalyzer handle 
non-European languages like Chinese?

Though, note that it won't deal with all languages' stop words but the
> English ones, unless specified at index time But then if you change the
> stop words at index time, what should you use at query time, some query
> it won't work well.

 I think we can easily create our own super stop-word lists by copying from 
whatever other language's stop word lists we can find.

But as far as I am concerned, each content (content in the sense of a
> CMS) is known to have multiple language, and each of these language
> *can* be indexed separately with no problem at all, and therefore a
> dedicated analyser could be use. So I was wondering whether my approach
> could be the right one of if it was over complex, and could introduce
> some problem I could not see... (My approach being: one index per 
> language)

 My suggestion would be to create one index for all languages with each 
document having a 'lang' attribute. Lucene is quite scalable right? So this 
should not be an issue.
 During search, you can either default to turn on the 'lang' attribute 
condition or default to off, depending on what your users want most often. 
But it will be very easy to search multiple language documents.
 
> I don't know if the developpers of lucene would agree, but from what
> I've been browsing on the ML archives, those multiple language issues
> seems to arrise quite often in the mailing list, and maybe some articles
> like "best practices", "do's and don'ts" or "Lucene Architecture in
> multiple language environement", might be really nice to see :) If some
> of you have the time and the experience to write them I'll be really
> thankful! :)

 What keywords do you use to search? Somehow, I cannot find any discussion 
about multiple language on the ML archive. I even did Google! :-) Or maybe I 
was giving the keywords in the wrong language? :-)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message