lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Al-Obaidy <ahmad_aloba...@yahoo.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:18:56 GMT
There is no upper and lower case in Arabic.

--- On Thu, 10/8/09, DM Smith <dmsmith555@gmail.com> wrote:

From: DM Smith <dmsmith555@gmail.com>
Subject: Re: Arabic Analyzer: possible bug
To: java-dev@lucene.apache.org
Date: Thursday, October 8, 2009, 3:14 PM

Robert,Thanks for the info.As I said, I am illiterate in Arabic. So I have another, perhaps
nonsensical, question:Does the stop word list have every combination of upper/lower case for
each Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not want to have to create
variations with farsi yah versus arabic yah for each one.



On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:


I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know Arabic or Farsi,
but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:

        TokenStream result = new ArabicLetterTokenizer( reader );
        result = new StopFilter( result, stoptable );
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter( result );


        result = new ArabicStemFilter( result );

        return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


 As a comparison the PersianAnalyzer has:

    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);


    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*
     * the order here is important: the stopword list is normalized with the
     * above!
     */


    result = new StopFilter(result, stoptable);

    return result;


Thanks,

	DM


-- 
Robert Muir
rcmuir@gmail.com







      
Mime
View raw message