lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:25:47 GMT
the upper/lower case is there, in case you happen to have some english text
mixed in :)

but to answer your question, the stopword list contains some variant forms,
and I added a couple more in LUCENE-1758.

Maybe this will help:
ArabicNormalizer is 'aggressive' for arabic language.
ArabicNormalizer + PersianNormalizer is 'not very aggressive' for persian
language.

So for arabic language, i thought it unsafe to normalize the stopwords.

For persian language, the normalizer is really important so the stopwords
list will work regardless of encoding (they use a variant form of yah and
kaf sometimes, especially depending on computer system/legacy encoding).
Also, most words in persian stopword list, aren't even real words on their
own.

the languages are very different so the analyzers work in different ways...

On Thu, Oct 8, 2009 at 9:18 AM, Ahmed Al-Obaidy <ahmad_alobaidy@yahoo.com>wrote:

> There is no upper and lower case in Arabic.
>
> --- On *Thu, 10/8/09, DM Smith <dmsmith555@gmail.com>* wrote:
>
>
> From: DM Smith <dmsmith555@gmail.com>
> Subject: Re: Arabic Analyzer: possible bug
> To: java-dev@lucene.apache.org
> Date: Thursday, October 8, 2009, 3:14 PM
>
>
> Robert,Thanks for the info.
> As I said, I am illiterate in Arabic. So I have another, perhaps
> nonsensical, question:
> Does the stop word list have every combination of upper/lower case for each
> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come
> after LowerCaseFilter?
>
> -- DM
>
> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did not want
> to have to create variations with farsi yah versus arabic yah for each one.
>
> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com<http://mc/compose?to=dmsmith555@gmail.com>
> > wrote:
>
>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
>> Arabic or Farsi, but have some texts to index in those languages.)
>> The tokenizer/filter chain for ArabicAnalyzer is:
>>         TokenStream result = new ArabicLetterTokenizer( reader );
>>         result = new StopFilter( result, stoptable );
>>         result = new LowerCaseFilter(result);
>>         result = new ArabicNormalizationFilter( result );
>>         result = new ArabicStemFilter( result );
>>
>>         return result;
>>
>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>
>>
>> As a comparison the PersianAnalyzer has:
>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>     result = new LowerCaseFilter(result);
>>     result = new ArabicNormalizationFilter(result);
>>     /* additional persian-specific normalization */
>>     result = new PersianNormalizationFilter(result);
>>     /*
>>      * the order here is important: the stopword list is normalized with
>> the
>>      * above!
>>      */
>>     result = new StopFilter(result, stoptable);
>>
>>     return result;
>>
>>
>> Thanks,
>>  DM
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com <http://mc/compose?to=rcmuir@gmail.com>
>
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message