lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 11:24:45 GMT
I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't  
know Arabic or Farsi, but have some texts to index in those languages.)

The tokenizer/filter chain for ArabicAnalyzer is:
         TokenStream result = new ArabicLetterTokenizer( reader );
         result = new StopFilter( result, stoptable );
         result = new LowerCaseFilter(result);
         result = new ArabicNormalizationFilter( result );
         result = new ArabicStemFilter( result );

         return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


As a comparison the PersianAnalyzer has:
     TokenStream result = new ArabicLetterTokenizer(reader);
     result = new LowerCaseFilter(result);
     result = new ArabicNormalizationFilter(result);
     /* additional persian-specific normalization */
     result = new PersianNormalizationFilter(result);
     /*
      * the order here is important: the stopword list is normalized  
with the
      * above!
      */
     result = new StopFilter(result, stoptable);

     return result;


Thanks,
	DM
Mime
View raw message