lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <>
Subject Re: Arabic analyzer
Date Thu, 07 Oct 2004 16:11:35 GMT
Someone posted an Arabic analyzer about 1 year ago, however, I don't
think the licensing was very friendly and we no longer use it.

We have a cross language system that works w/ Arabic (among other
languages).  We have written several stemmers based on the literature
that perform pretty well
and were not too difficult to implement (but are not available as open
source at this point).  Light stemming seems to work much better in IR
applications then aggressive stemmers due to the problems with roots
discussed earlier.


Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies 

>>> 10/7/2004 8:45:42 AM >>>
Dawid Weiss wrote:
>> nothing to do with each other furthermore, Arabic uses phonetic 
>> indicators on each letter called diacritics that change the way you

>> pronounce the word which in turn changes the words meaning so two
>> spelled exactly the same way with different diacritics will mean two

>> separate things, 
> Just to point out the fact: most slavic languages also use diacritic

> marks (above, like 'acute', or 'dot' marks, or below, like the Polish

> 'ogonek' mark). Some people argue that they can be stripped off the
> upon indexing and that the queries usually disambiguate the context
> the word.

Hmm. This brings up a question: the algorithmic stemmer package from 
Egothor works quite well for Polish (, 
wouldn't it work well for Arabic, too?

I lack the necessary expertise to evaluate results (knowing only two or

three arabic words ;-) ), but I can certainly help someone to get 
started with testing...

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message