lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:36:33 GMT
DM, i suppose. but this is a tricky subject, what if you have mixed Arabic /
German or something like that?

for some other languages written in the Latin script, English stopwords
could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith555@gmail.com> wrote:

> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>
>> Just an addition: The lowercase filter is only for the case of embedded
>> non-arabic words. And these will not appear in the stop words.
>>
>>
> I learned something new!
>
> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
> augment the stopwords list with English stop words? And if so, shouldn't the
> stop filter come after the lower case filter?
>
> -- DM
>
>
>  -----Original Message-----
>>> From: Basem Narmok [mailto:narmok@gmail.com]
>>> Sent: Thursday, October 08, 2009 4:20 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: Arabic Analyzer: possible bug
>>>
>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>>> stop word list needs some corrections and may miss some common/stop
>>> Arabic words.
>>>
>>> Best,
>>>
>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith555@gmail.com>  wrote:
>>>
>>>
>>>> Robert,
>>>> Thanks for the info.
>>>> As I said, I am illiterate in Arabic. So I have another, perhaps
>>>> nonsensical, question:
>>>> Does the stop word list have every combination of upper/lower case for
>>>>
>>>>
>>> each
>>>
>>>
>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
>>>>
>>>>
>>> come
>>>
>>>
>>>> after LowerCaseFilter?
>>>> -- DM
>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>>>
>>>> DM, this isn't a bug.
>>>>
>>>> The arabic stopwords are not normalized.
>>>>
>>>> but for persian, i normalized the stopwords. mostly because i did not
>>>>
>>>>
>>> want
>>>
>>>
>>>> to have to create variations with farsi yah versus arabic yah for each
>>>>
>>>>
>>> one.
>>>
>>>
>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith555@gmail.com>  wrote:
>>>>
>>>>
>>>>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>>>>>
>>>>>
>>>> know
>>>
>>>
>>>> Arabic or Farsi, but have some texts to index in those languages.)
>>>>> The tokenizer/filter chain for ArabicAnalyzer is:
>>>>>         TokenStream result = new ArabicLetterTokenizer( reader );
>>>>>         result = new StopFilter( result, stoptable );
>>>>>         result = new LowerCaseFilter(result);
>>>>>         result = new ArabicNormalizationFilter( result );
>>>>>         result = new ArabicStemFilter( result );
>>>>>
>>>>>         return result;
>>>>>
>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>>>>
>>>>> As a comparison the PersianAnalyzer has:
>>>>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>>>>     result = new LowerCaseFilter(result);
>>>>>     result = new ArabicNormalizationFilter(result);
>>>>>     /* additional persian-specific normalization */
>>>>>     result = new PersianNormalizationFilter(result);
>>>>>     /*
>>>>>      * the order here is important: the stopword list is normalized
>>>>>
>>>>>
>>>> with
>>>
>>>
>>>> the
>>>>>      * above!
>>>>>      */
>>>>>     result = new StopFilter(result, stoptable);
>>>>>
>>>>>     return result;
>>>>>
>>>>>
>>>>> Thanks,
>>>>> DM
>>>>>
>>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message