lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ernesto De Santis <ernesto.desan...@colaborativa.net>
Subject Re: Multi-analyzer ?
Date Tue, 12 Apr 2005 12:12:47 GMT
Maybe you can use PerFieldAnalyzerWrapper.
(I never used this)

Ernesto.

Eric Chow escribi├│:

>But how about one document contains more than two different languages ??
>
>
>Eric
>
>On Apr 12, 2005 12:13 AM, Andy Roberts <mail@andy-roberts.net> wrote:
>  
>
>>On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
>>    
>>
>>>Your example with Arabic wouldn't work reliably either - there are
>>>several other languages that use the Arabic script (Persian for
>>>example).
>>>      
>>>
>>Good point. Although you could try a simple approach to test for the
>>additional characters that exist in Persian but not in Arabic. Although, this
>>again is not fool-proof. A letter-model approach would be better but is
>>rather time consuming.
>>
>>    
>>
>>>This is the sort of problem that the end user can solve much better
>>>than the software can.
>>>
>>>      
>>>
>>I completely agree, which is why I originally suggested prompting the user for
>>this info. It may be the case that for the majority of queries, English is
>>the usual language. And it is probably more feasible to do a test to
>>determine whether the query English or not (still very tricky, mind). If not,
>>then prompt the user to specify their input language because otherwise,
>>results will be poor.
>>
>>Andy Roberts
>>
>>    
>>
>>>-MB
>>>
>>>On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
>>>      
>>>
>>>>Can you not provide the user with a option list to specify their input
>>>>language?
>>>>
>>>>Language identification can be a pretty tricky field. There are some
>>>>tricks
>>>>you can do with unicode to identify language, e.g., \u0600 - \u06FF
>>>>contains
>>>>the Arabic characters, so if you're input contains lots of chars
>>>>within this
>>>>range, you can guess that the input is Arabic, for example.
>>>>
>>>>The problem comes with differentiating between the languages that use
>>>>a Latin
>>>>alphabet. Again, there are multiple approaches, although the only one
>>>>I know
>>>>of that worked pretty well for identifying European languages was to
>>>>build a
>>>>model based on character bigrams (that is, sequences of two letters)
>>>>[1]
>>>>
>>>>At the end of the day, Lucene cannot help you in choosing the correct
>>>>language
>>>>as it doesn't know, and so it'll be up to you to add the necessary
>>>>logic to
>>>>tell Lucene which Analyzers to utilise. :(
>>>>
>>>>Andy
>>>>
>>>>[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
>>>>Bigram and
>>>>trigram models for language identification and classification in:
>>>>Evett, L &
>>>>Rose,T (editors) Computational Linguistics for Speech and Handwriting
>>>>Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
>>>>
>>>>On Monday 11 Apr 2005 01:21, Eric Chow wrote:
>>>>        
>>>>
>>>>>Hello,
>>>>>
>>>>>If I don't know the language of the input terms, how can I use
>>>>>different analyzer to search it ?
>>>>>
>>>>>For example, the input box accepts UTF-8 search text, they can be
>>>>>anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
>>>>>can search any of them or all of them with Lucene?
>>>>>
>>>>>Any example, please?
>>>>>
>>>>>
>>>>>Best Regards,
>>>>>Eric
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>

-- 
Ernesto De Santis - Colaborativa.net
C├│rdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message