lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:29:51 GMT
On 10/08/2009 09:23 AM, Uwe Schindler wrote:
> Just an addition: The lowercase filter is only for the case of embedded
> non-arabic words. And these will not appear in the stop words.
>    
I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't one be able to 
augment the stopwords list with English stop words? And if so, shouldn't 
the stop filter come after the lower case filter?

-- DM

>> -----Original Message-----
>> From: Basem Narmok [mailto:narmok@gmail.com]
>> Sent: Thursday, October 08, 2009 4:20 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: Arabic Analyzer: possible bug
>>
>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> stop word list needs some corrections and may miss some common/stop
>> Arabic words.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith555@gmail.com>  wrote:
>>      
>>> Robert,
>>> Thanks for the info.
>>> As I said, I am illiterate in Arabic. So I have another, perhaps
>>> nonsensical, question:
>>> Does the stop word list have every combination of upper/lower case for
>>>        
>> each
>>      
>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
>>>        
>> come
>>      
>>> after LowerCaseFilter?
>>> -- DM
>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>>
>>> DM, this isn't a bug.
>>>
>>> The arabic stopwords are not normalized.
>>>
>>> but for persian, i normalized the stopwords. mostly because i did not
>>>        
>> want
>>      
>>> to have to create variations with farsi yah versus arabic yah for each
>>>        
>> one.
>>      
>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith555@gmail.com>  wrote:
>>>        
>>>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>>>>          
>> know
>>      
>>>> Arabic or Farsi, but have some texts to index in those languages.)
>>>> The tokenizer/filter chain for ArabicAnalyzer is:
>>>>          TokenStream result = new ArabicLetterTokenizer( reader );
>>>>          result = new StopFilter( result, stoptable );
>>>>          result = new LowerCaseFilter(result);
>>>>          result = new ArabicNormalizationFilter( result );
>>>>          result = new ArabicStemFilter( result );
>>>>
>>>>          return result;
>>>>
>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>>>
>>>> As a comparison the PersianAnalyzer has:
>>>>      TokenStream result = new ArabicLetterTokenizer(reader);
>>>>      result = new LowerCaseFilter(result);
>>>>      result = new ArabicNormalizationFilter(result);
>>>>      /* additional persian-specific normalization */
>>>>      result = new PersianNormalizationFilter(result);
>>>>      /*
>>>>       * the order here is important: the stopword list is normalized
>>>>          
>> with
>>      
>>>> the
>>>>       * above!
>>>>       */
>>>>      result = new StopFilter(result, stoptable);
>>>>
>>>>      return result;
>>>>
>>>>
>>>> Thanks,
>>>> DM
>>>>          
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>>
>>>        
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>      
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message