lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Arabic analyzer
Date Thu, 07 Oct 2004 12:45:42 GMT
Dawid Weiss wrote:
> 
>> nothing to do with each other furthermore, Arabic uses phonetic 
>> indicators on each letter called diacritics that change the way you 
>> pronounce the word which in turn changes the words meaning so two word 
>> spelled exactly the same way with different diacritics will mean two 
>> separate things, 
> 
> 
> Just to point out the fact: most slavic languages also use diacritic 
> marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
> 'ogonek' mark). Some people argue that they can be stripped off the text 
> upon indexing and that the queries usually disambiguate the context of 
> the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (http://www.getopt.org/stempel), 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or 
three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message