lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 12:37:20 GMT
DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not want
to have to create variations with farsi yah versus arabic yah for each one.

On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:

> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
> Arabic or Farsi, but have some texts to index in those languages.)
> The tokenizer/filter chain for ArabicAnalyzer is:
>         TokenStream result = new ArabicLetterTokenizer( reader );
>         result = new StopFilter( result, stoptable );
>         result = new LowerCaseFilter(result);
>         result = new ArabicNormalizationFilter( result );
>         result = new ArabicStemFilter( result );
>
>         return result;
>
> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>
>
> As a comparison the PersianAnalyzer has:
>     TokenStream result = new ArabicLetterTokenizer(reader);
>     result = new LowerCaseFilter(result);
>     result = new ArabicNormalizationFilter(result);
>     /* additional persian-specific normalization */
>     result = new PersianNormalizationFilter(result);
>     /*
>      * the order here is important: the stopword list is normalized with
> the
>      * above!
>      */
>     result = new StopFilter(result, stoptable);
>
>     return result;
>
>
> Thanks,
>  DM
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message