lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:14:55 GMT
Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps  
nonsensical, question:
Does the stop word list have every combination of upper/lower case for  
each Arabic word in the list? (i.e. is it fully de-normalized?) Or  
should it come after LowerCaseFilter?

-- DM

On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did  
> not want to have to create variations with farsi yah versus arabic  
> yah for each one.
>
> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com> wrote:
> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't  
> know Arabic or Farsi, but have some texts to index in those  
> languages.)
>
> The tokenizer/filter chain for ArabicAnalyzer is:
>         TokenStream result = new ArabicLetterTokenizer( reader );
>         result = new StopFilter( result, stoptable );
>         result = new LowerCaseFilter(result);
>         result = new ArabicNormalizationFilter( result );
>         result = new ArabicStemFilter( result );
>
>         return result;
>
> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>
>
> As a comparison the PersianAnalyzer has:
>     TokenStream result = new ArabicLetterTokenizer(reader);
>     result = new LowerCaseFilter(result);
>     result = new ArabicNormalizationFilter(result);
>     /* additional persian-specific normalization */
>     result = new PersianNormalizationFilter(result);
>     /*
>      * the order here is important: the stopword list is normalized  
> with the
>      * above!
>      */
>     result = new StopFilter(result, stoptable);
>
>     return result;
>
>
> Thanks,
> 	DM
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Mime
View raw message