lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Fri, 26 Sep 2008 17:45:46 GMT


Robert Muir commented on LUCENE-1406:

Thought I would add the following comments:

I tried to stick to basics to start. Some things that kept bugging me just for the record:

1) the rules for stemming only require stemmed token to have 2 characters in many places.
This seems incorrect... triliteral root anyone? Seems to be too aggresive. Yet at the same
time, many common "prefix"/suffix combinations are not stemmed by light8 algorithm...  But
its trec tested... 

2) there is no decomposition of unicode presentation forms. These characters show up (typically
when text is extracted out of PDF). The easiest way to deal with this is Unicode normalization,
but that requires Java 6 or ICU.

3) there is no enhanced parsing. Typically academics index high quality news text but in other
less perfect text often you see much text without spaces between words when the characters
do not join (to the human reader there is a space!). to really solve this you need a lot of
special stuff including morphological data, but you can partially solve some of the common
cases by splitting words when you see 100% certain cases such as medial teh marbuta, medial
alef maksura, double alef, ... I didnt do this because I wanted to keep it simple, but its
important, see here:
4) it is simply a stemmer, but I read in lucene docs where it is possible to inject synonym-like
information (multiple tokens for one word) and boost the score for certain ones. Seems like
this would be better than simply stemming, at least indexing and boosting the normalized surface
form for better precision. I'd want to setup TREC tests to actually measure this though.

> new Arabic Analyzer (Apache license)
> ------------------------------------
>                 Key: LUCENE-1406
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's
morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for a quality
arabic search. 
> This implementation implements the light-8s algorithm present in the following paper:
> As you can see from the paper, improvement via this method over searching surface forms
(as lucene currently does) is significant, with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and some easily improvements
are still possible, the major motivation for implementing it exactly the way it is presented
in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to
lucene are already documented.
> For a stopword list, I used a list present at
simply because the creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated
on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is no object
creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words of arabic
text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I use lucene
on a daily basis and would like to give something back. Thanks.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message