lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gusenbauer Stefan <gusenba...@eduhi.at>
Subject Re: Multiple Language Indexing and Searching
Date Tue, 06 Sep 2005 12:14:05 GMT
James Adams wrote:

>Does anyone know what approach does Nutch uses?
>
>
>-----Original Message-----
>From: Hacking Bear [mailto:hackingbear@gmail.com] 
>Sent: 06 September 2005 12:15
>To: java-user@lucene.apache.org
>Subject: Re: Multiple Language Indexing and Searching
>
>On 9/6/05, Olivier Jaquemet <olivier.jaquemet@jalios.com> wrote: 
>  
>
>>As far as your usage is concerned, it seems to be the right approach,
>>and I think the StandardAnalyzer does the job pretty right when it has
>>to deal with whatever language you want.
>>    
>>
>
> I should look into exactly what it does. Does this StandardAnalyzer
>handle 
>non-European languages like Chinese?
>
>Though, note that it won't deal with all languages' stop words but the
>  
>
>>English ones, unless specified at index time But then if you change
>>    
>>
>the
>  
>
>>stop words at index time, what should you use at query time, some
>>    
>>
>query
>  
>
>>it won't work well.
>>    
>>
>
> I think we can easily create our own super stop-word lists by copying
>from 
>whatever other language's stop word lists we can find.
>
>But as far as I am concerned, each content (content in the sense of a
>  
>
>>CMS) is known to have multiple language, and each of these language
>>*can* be indexed separately with no problem at all, and therefore a
>>dedicated analyser could be use. So I was wondering whether my
>>    
>>
>approach
>  
>
>>could be the right one of if it was over complex, and could introduce
>>some problem I could not see... (My approach being: one index per 
>>language)
>>    
>>
>
> My suggestion would be to create one index for all languages with each 
>document having a 'lang' attribute. Lucene is quite scalable right? So
>this 
>should not be an issue.
> During search, you can either default to turn on the 'lang' attribute 
>condition or default to off, depending on what your users want most
>often. 
>But it will be very easy to search multiple language documents.
> 
>  
>
>>I don't know if the developpers of lucene would agree, but from what
>>I've been browsing on the ML archives, those multiple language issues
>>seems to arrise quite often in the mailing list, and maybe some
>>    
>>
>articles
>  
>
>>like "best practices", "do's and don'ts" or "Lucene Architecture in
>>multiple language environement", might be really nice to see :) If
>>    
>>
>some
>  
>
>>of you have the time and the experience to write them I'll be really
>>thankful! :)
>>    
>>
>
> What keywords do you use to search? Somehow, I cannot find any
>discussion 
>about multiple language on the ML archive. I even did Google! :-) Or
>maybe I 
>was giving the keywords in the wrong language? :-)
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>  
>
I think nutch uses ngramj for language classification but i don't know
what type of saving language information they use. In our application
for example i save the language in an extra field in the document
because lucene is supporting multiple fields with the same names we
would be able to handle different languages. but for now we don't need it

greets
stefan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message