Robert,
Yes it is tricky.

I'm not suggesting that the ArabicAnalyzer have any stopwords other than Arabic.

I'm suggesting that if I know my input document well and know that it has mixed text and that the text is Arabic and one other known language that I might want to augment the stop list with stop words appropriate for that known language. I think that in this case, stop filter should be after lower case filter.

As to lower casing across the board, I also think it is pretty safe. But I think there are some edge cases. For example, lowercasing a Greek word in all upper case ending in sigma will not produce the same as lower casing the same Greek word in all lower case. The Greek word should have a final sigma rather than a small sigma. For Greek, using an UpperCaseFilter followed by a LowerCaseFilter would handle this case.

IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM

On 10/08/2009 09:36 AM, Robert Muir wrote:
DM, i suppose. but this is a tricky subject, what if you have mixed Arabic / German or something like that?

for some other languages written in the Latin script, English stopwords could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith555@gmail.com> wrote:
On 10/08/2009 09:23 AM, Uwe Schindler wrote:
Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.
 
I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't one be able to augment the stopwords list with English stop words? And if so, shouldn't the stop filter come after the lower case filter?

-- DM


-----Original Message-----
From: Basem Narmok [mailto:narmok@gmail.com]
Sent: Thursday, October 08, 2009 4:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Arabic Analyzer: possible bug

DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith555@gmail.com>  wrote:
   
Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps
nonsensical, question:
Does the stop word list have every combination of upper/lower case for
     
each
   
Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
     
come
   
after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not
     
want
   
to have to create variations with farsi yah versus arabic yah for each
     
one.
   
On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith555@gmail.com>  wrote:
     
I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
       
know
   
Arabic or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
        TokenStream result = new ArabicLetterTokenizer( reader );
        result = new StopFilter( result, stoptable );
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter( result );
        result = new ArabicStemFilter( result );

        return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?

As a comparison the PersianAnalyzer has:
    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);
    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*
     * the order here is important: the stopword list is normalized
       
with
   
the
     * above!
     */
    result = new StopFilter(result, stoptable);

    return result;


Thanks,
DM
       

--
Robert Muir
rcmuir@gmail.com