lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Arabic analyzer
Date Thu, 07 Oct 2004 12:45:42 GMT
Dawid Weiss wrote:
>> nothing to do with each other furthermore, Arabic uses phonetic 
>> indicators on each letter called diacritics that change the way you 
>> pronounce the word which in turn changes the words meaning so two word 
>> spelled exactly the same way with different diacritics will mean two 
>> separate things, 
> Just to point out the fact: most slavic languages also use diacritic 
> marks (above, like 'acute', or 'dot' marks, or below, like the Polish 
> 'ogonek' mark). Some people argue that they can be stripped off the text 
> upon indexing and that the queries usually disambiguate the context of 
> the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (, 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or 
three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message