lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Baranczak <mbara...@twcny.rr.com>
Subject Re: Multi-analyzer ?
Date Mon, 11 Apr 2005 14:55:33 GMT
Your example with Arabic wouldn't work reliably either - there are 
several other languages that use the Arabic script (Persian for 
example).

You could also try to pick out characters that are unique to a 
particular language - for example, Ę or Ż only occur in Polish (as far 
as I know...). Of course, you have no guarantee that a Polish-language 
query will actually contain any of those characters - so this method 
would only work as a supplement to another method.

And don't forget that some words are written the same in several 
different languages.

This is the sort of problem that the end user can solve much better 
than the software can.

-MB


On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:

> Can you not provide the user with a option list to specify their input
> language?
>
> Language identification can be a pretty tricky field. There are some 
> tricks
> you can do with unicode to identify language, e.g., \u0600 - \u06FF 
> contains
> the Arabic characters, so if you're input contains lots of chars 
> within this
> range, you can guess that the input is Arabic, for example.
>
> The problem comes with differentiating between the languages that use 
> a Latin
> alphabet. Again, there are multiple approaches, although the only one 
> I know
> of that worked pretty well for identifying European languages was to 
> build a
> model based on character bigrams (that is, sequences of two letters) 
> [1]
>
> At the end of the day, Lucene cannot help you in choosing the correct 
> language
> as it doesn't know, and so it'll be up to you to add the necessary 
> logic to
> tell Lucene which Analyzers to utilise. :(
>
> Andy
>
> [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. 
> Bigram and
> trigram models for language identification and classification in: 
> Evett, L &
> Rose,T (editors) Computational Linguistics for Speech and Handwriting
> Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
>
> On Monday 11 Apr 2005 01:21, Eric Chow wrote:
>> Hello,
>>
>> If I don't know the language of the input terms, how can I use
>> different analyzer to search it ?
>>
>> For example, the input box accepts UTF-8 search text, they can be
>> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
>> can search any of them or all of them with Lucene?
>>
>> Any example, please?
>>
>>
>> Best Regards,
>> Eric
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message