lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:23:18 GMT
Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Basem Narmok [mailto:narmok@gmail.com]
> Sent: Thursday, October 08, 2009 4:20 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Arabic Analyzer: possible bug
> 
> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> stop word list needs some corrections and may miss some common/stop
> Arabic words.
> 
> Best,
> 
> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote:
> > Robert,
> > Thanks for the info.
> > As I said, I am illiterate in Arabic. So I have another, perhaps
> > nonsensical, question:
> > Does the stop word list have every combination of upper/lower case for
> each
> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
> come
> > after LowerCaseFilter?
> > -- DM
> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >
> > DM, this isn't a bug.
> >
> > The arabic stopwords are not normalized.
> >
> > but for persian, i normalized the stopwords. mostly because i did not
> want
> > to have to create variations with farsi yah versus arabic yah for each
> one.
> >
> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:
> >>
> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
> know
> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >>         TokenStream result = new ArabicLetterTokenizer( reader );
> >>         result = new StopFilter( result, stoptable );
> >>         result = new LowerCaseFilter(result);
> >>         result = new ArabicNormalizationFilter( result );
> >>         result = new ArabicStemFilter( result );
> >>
> >>         return result;
> >>
> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >>
> >> As a comparison the PersianAnalyzer has:
> >>     TokenStream result = new ArabicLetterTokenizer(reader);
> >>     result = new LowerCaseFilter(result);
> >>     result = new ArabicNormalizationFilter(result);
> >>     /* additional persian-specific normalization */
> >>     result = new PersianNormalizationFilter(result);
> >>     /*
> >>      * the order here is important: the stopword list is normalized
> with
> >> the
> >>      * above!
> >>      */
> >>     result = new StopFilter(result, stoptable);
> >>
> >>     return result;
> >>
> >>
> >> Thanks,
> >> DM
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message