lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basem Narmok <nar...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:32:44 GMT
Robert,

I will be happy to do so. Currently, I am testing the new Arabic
analyzer in 2.9, and also I will prepare a new stop word list. I will
provide you with my findings/comments soon.

Best,

On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Basem, by any chance would you be willing to help improve it for us?
>
> On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <narmok@gmail.com> wrote:
>>
>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> stop word list needs some corrections and may miss some common/stop
>> Arabic words.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote:
>> > Robert,
>> > Thanks for the info.
>> > As I said, I am illiterate in Arabic. So I have another, perhaps
>> > nonsensical, question:
>> > Does the stop word list have every combination of upper/lower case for
>> > each
>> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
>> > come
>> > after LowerCaseFilter?
>> > -- DM
>> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>> >
>> > DM, this isn't a bug.
>> >
>> > The arabic stopwords are not normalized.
>> >
>> > but for persian, i normalized the stopwords. mostly because i did not
>> > want
>> > to have to create variations with farsi yah versus arabic yah for each
>> > one.
>> >
>> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:
>> >>
>> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>> >> know
>> >> Arabic or Farsi, but have some texts to index in those languages.)
>> >> The tokenizer/filter chain for ArabicAnalyzer is:
>> >>         TokenStream result = new ArabicLetterTokenizer( reader );
>> >>         result = new StopFilter( result, stoptable );
>> >>         result = new LowerCaseFilter(result);
>> >>         result = new ArabicNormalizationFilter( result );
>> >>         result = new ArabicStemFilter( result );
>> >>
>> >>         return result;
>> >>
>> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>> >>
>> >> As a comparison the PersianAnalyzer has:
>> >>     TokenStream result = new ArabicLetterTokenizer(reader);
>> >>     result = new LowerCaseFilter(result);
>> >>     result = new ArabicNormalizationFilter(result);
>> >>     /* additional persian-specific normalization */
>> >>     result = new PersianNormalizationFilter(result);
>> >>     /*
>> >>      * the order here is important: the stopword list is normalized
>> >> with
>> >> the
>> >>      * above!
>> >>      */
>> >>     result = new StopFilter(result, stoptable);
>> >>
>> >>     return result;
>> >>
>> >>
>> >> Thanks,
>> >> DM
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message