lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:28:01 GMT
Basem, by any chance would you be willing to help improve it for us?

On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <narmok@gmail.com> wrote:

> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> stop word list needs some corrections and may miss some common/stop
> Arabic words.
>
> Best,
>
> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote:
> > Robert,
> > Thanks for the info.
> > As I said, I am illiterate in Arabic. So I have another, perhaps
> > nonsensical, question:
> > Does the stop word list have every combination of upper/lower case for
> each
> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
> come
> > after LowerCaseFilter?
> > -- DM
> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >
> > DM, this isn't a bug.
> >
> > The arabic stopwords are not normalized.
> >
> > but for persian, i normalized the stopwords. mostly because i did not
> want
> > to have to create variations with farsi yah versus arabic yah for each
> one.
> >
> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:
> >>
> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >>         TokenStream result = new ArabicLetterTokenizer( reader );
> >>         result = new StopFilter( result, stoptable );
> >>         result = new LowerCaseFilter(result);
> >>         result = new ArabicNormalizationFilter( result );
> >>         result = new ArabicStemFilter( result );
> >>
> >>         return result;
> >>
> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >>
> >> As a comparison the PersianAnalyzer has:
> >>     TokenStream result = new ArabicLetterTokenizer(reader);
> >>     result = new LowerCaseFilter(result);
> >>     result = new ArabicNormalizationFilter(result);
> >>     /* additional persian-specific normalization */
> >>     result = new PersianNormalizationFilter(result);
> >>     /*
> >>      * the order here is important: the stopword list is normalized with
> >> the
> >>      * above!
> >>      */
> >>     result = new StopFilter(result, stoptable);
> >>
> >>     return result;
> >>
> >>
> >> Thanks,
> >> DM
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message