lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 15:38:51 GMT
>
> I'm suggesting that if I know my input document well and know that it has
> mixed text and that the text is Arabic and one other known language that I
> might want to augment the stop list with stop words appropriate for that
> known language. I think that in this case, stop filter should be after lower
> case filter.
>
 I think this is a good idea?

>
> As to lower casing across the board, I also think it is pretty safe. But I
> think there are some edge cases. For example, lowercasing a Greek word in
> all upper case ending in sigma will not produce the same as lower casing the
> same Greek word in all lower case. The Greek word should have a final sigma
> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a
> LowerCaseFilter would handle this case.
>
or you could use unicode case folding. lowercasing is for display purposes,
not search.

>
> IMHO, this is not an issue for the Arabic or Persian analyzers.
>
> -- DM
>
>
> On 10/08/2009 09:36 AM, Robert Muir wrote:
>
> DM, i suppose. but this is a tricky subject, what if you have mixed Arabic
> / German or something like that?
>
> for some other languages written in the Latin script, English stopwords
> could be bad :)
>
> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
> across the board though.
>
> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith555@gmail.com> wrote:
>
>> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>>
>>> Just an addition: The lowercase filter is only for the case of embedded
>>> non-arabic words. And these will not appear in the stop words.
>>>
>>>
>>  I learned something new!
>>
>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
>> augment the stopwords list with English stop words? And if so, shouldn't the
>> stop filter come after the lower case filter?
>>
>> -- DM
>>
>>  -----Original Message-----
>>>> From: Basem Narmok [mailto:narmok@gmail.com]
>>>> Sent: Thursday, October 08, 2009 4:20 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: Arabic Analyzer: possible bug
>>>>
>>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>>>> stop word list needs some corrections and may miss some common/stop
>>>> Arabic words.
>>>>
>>>> Best,
>>>>
>>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith555@gmail.com>  wrote:
>>>>
>>>>
>>>>> Robert,
>>>>> Thanks for the info.
>>>>> As I said, I am illiterate in Arabic. So I have another, perhaps
>>>>> nonsensical, question:
>>>>> Does the stop word list have every combination of upper/lower case for
>>>>>
>>>>>
>>>> each
>>>>
>>>>
>>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should
it
>>>>>
>>>>>
>>>> come
>>>>
>>>>
>>>>> after LowerCaseFilter?
>>>>> -- DM
>>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>>>>
>>>>> DM, this isn't a bug.
>>>>>
>>>>> The arabic stopwords are not normalized.
>>>>>
>>>>> but for persian, i normalized the stopwords. mostly because i did not
>>>>>
>>>>>
>>>> want
>>>>
>>>>
>>>>> to have to create variations with farsi yah versus arabic yah for each
>>>>>
>>>>>
>>>> one.
>>>>
>>>>
>>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith555@gmail.com>
 wrote:
>>>>>
>>>>>
>>>>>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>>>>>>
>>>>>>
>>>>>  know
>>>>
>>>>
>>>>> Arabic or Farsi, but have some texts to index in those languages.)
>>>>>> The tokenizer/filter chain for ArabicAnalyzer is:
>>>>>>         TokenStream result = new ArabicLetterTokenizer( reader );
>>>>>>         result = new StopFilter( result, stoptable );
>>>>>>         result = new LowerCaseFilter(result);
>>>>>>         result = new ArabicNormalizationFilter( result );
>>>>>>         result = new ArabicStemFilter( result );
>>>>>>
>>>>>>         return result;
>>>>>>
>>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>>>>>
>>>>>> As a comparison the PersianAnalyzer has:
>>>>>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>>>>>     result = new LowerCaseFilter(result);
>>>>>>     result = new ArabicNormalizationFilter(result);
>>>>>>     /* additional persian-specific normalization */
>>>>>>     result = new PersianNormalizationFilter(result);
>>>>>>     /*
>>>>>>      * the order here is important: the stopword list is normalized
>>>>>>
>>>>>>
>>>>>  with
>>>>
>>>>
>>>>> the
>>>>>>      * above!
>>>>>>      */
>>>>>>     result = new StopFilter(result, stoptable);
>>>>>>
>>>>>>     return result;
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> DM
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> rcmuir@gmail.com
>>>>>
>>>>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message