lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basem Narmok <nar...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 20:23:05 GMT
Ok, the list is ready (initial one, as I will continue enhancing it).
I will create JIRA issue and send the patch.

Also, I have some small changes to the normalization (e.g. removing
some diacritics, and other changes)

Best,
Basem

On Thu, Oct 8, 2009 at 8:51 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Basem, I really appreciate your time if you are able to do this.
>
> Its been my hope that introducing Arabic/Farsi support will create enough
> interest to encourage more qualified people to come and really make things
> nice.
>
> If you don't mind, you can look at
> http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue
> with a patch file to improve our stopwords list.
>
> Otherwise, in my opinion a good list is also acceptable and I will volunteer
> to turn it into a patch :)
>
> On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok <narmok@gmail.com> wrote:
>>
>> Robert,
>>
>> I will be happy to do so. Currently, I am testing the new Arabic
>> analyzer in 2.9, and also I will prepare a new stop word list. I will
>> provide you with my findings/comments soon.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> > Basem, by any chance would you be willing to help improve it for us?
>> >
>> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <narmok@gmail.com> wrote:
>> >>
>> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> >> stop word list needs some corrections and may miss some common/stop
>> >> Arabic words.
>> >>
>> >> Best,
>> >>
>> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote:
>> >> > Robert,
>> >> > Thanks for the info.
>> >> > As I said, I am illiterate in Arabic. So I have another, perhaps
>> >> > nonsensical, question:
>> >> > Does the stop word list have every combination of upper/lower case
>> >> > for
>> >> > each
>> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should
>> >> > it
>> >> > come
>> >> > after LowerCaseFilter?
>> >> > -- DM
>> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>> >> >
>> >> > DM, this isn't a bug.
>> >> >
>> >> > The arabic stopwords are not normalized.
>> >> >
>> >> > but for persian, i normalized the stopwords. mostly because i did not
>> >> > want
>> >> > to have to create variations with farsi yah versus arabic yah for
>> >> > each
>> >> > one.
>> >> >
>> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I
don't
>> >> >> know
>> >> >> Arabic or Farsi, but have some texts to index in those languages.)
>> >> >> The tokenizer/filter chain for ArabicAnalyzer is:
>> >> >>         TokenStream result = new ArabicLetterTokenizer(
reader );
>> >> >>         result = new StopFilter( result, stoptable );
>> >> >>         result = new LowerCaseFilter(result);
>> >> >>         result = new ArabicNormalizationFilter( result );
>> >> >>         result = new ArabicStemFilter( result );
>> >> >>
>> >> >>         return result;
>> >> >>
>> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>> >> >>
>> >> >> As a comparison the PersianAnalyzer has:
>> >> >>     TokenStream result = new ArabicLetterTokenizer(reader);
>> >> >>     result = new LowerCaseFilter(result);
>> >> >>     result = new ArabicNormalizationFilter(result);
>> >> >>     /* additional persian-specific normalization */
>> >> >>     result = new PersianNormalizationFilter(result);
>> >> >>     /*
>> >> >>      * the order here is important: the stopword list is normalized
>> >> >> with
>> >> >> the
>> >> >>      * above!
>> >> >>      */
>> >> >>     result = new StopFilter(result, stoptable);
>> >> >>
>> >> >>     return result;
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> DM
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcmuir@gmail.com
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message