lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierrick Brihaye" <pierrick.brih...@wanadoo.fr>
Subject Announce : arabic Stemmer/Analyzer for Lucene
Date Sun, 28 Sep 2003 07:49:14 GMT
Hi all,

I have written a Lucene Analyzer for arabic. You will find it here :
http://perso.wanadoo.fr/pierrick.brihaye/ArabicAnalyzer.jar (provisional
adress, anybody interested in hosting it ?)

This work is still in beta stage but it gives quite good results :-)

In order to make it work, you need :

1) a 1.4+ JVM (because of the native support for regular expressions which
are heavily used in the program ; I've been too lazy to use an external
package)

2) Apache Jakarta Commons-Collections :
http://jakarta.apache.org/commons/collections.html

3) a recent Lucene distribution ;-)

All this work is based on the amazing Tim Buckwalter's Arabic Morphological
Analyzer Version 1.0
(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49)
originaly written in Perl and released under the GPL.

The jar contains :

a) the compiled classes
b) the required data files (dictionaries and compatibility tables)
c) 2 command-line test programs
d) 3 test documents with different encodings
e) the source code
f) a README file that will give you a little bit more of information :-)

To Lucene developers : I plan to offer this work to Lucene (see the jar
hierarchy... and the source file headers ;-). Any objections ?

Feedback is very welcome : there are quite a lot of unresolved issues, with
the analyzer itselfs as well as with Lucene.

mE AlslAmap, cheers,

p.b.






Mime
View raw message