lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basem Narmok <nar...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:20:27 GMT
DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote:
> Robert,
> Thanks for the info.
> As I said, I am illiterate in Arabic. So I have another, perhaps
> nonsensical, question:
> Does the stop word list have every combination of upper/lower case for each
> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come
> after LowerCaseFilter?
> -- DM
> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did not want
> to have to create variations with farsi yah versus arabic yah for each one.
>
> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:
>>
>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
>> Arabic or Farsi, but have some texts to index in those languages.)
>> The tokenizer/filter chain for ArabicAnalyzer is:
>>         TokenStream result = new ArabicLetterTokenizer( reader );
>>         result = new StopFilter( result, stoptable );
>>         result = new LowerCaseFilter(result);
>>         result = new ArabicNormalizationFilter( result );
>>         result = new ArabicStemFilter( result );
>>
>>         return result;
>>
>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>
>> As a comparison the PersianAnalyzer has:
>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>     result = new LowerCaseFilter(result);
>>     result = new ArabicNormalizationFilter(result);
>>     /* additional persian-specific normalization */
>>     result = new PersianNormalizationFilter(result);
>>     /*
>>      * the order here is important: the stopword list is normalized with
>> the
>>      * above!
>>      */
>>     result = new StopFilter(result, stoptable);
>>
>>     return result;
>>
>>
>> Thanks,
>> DM
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message