lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 15:46:45 GMT
DM by the way, if you want this lowercasing behavior with edge cases, check
out LUCENE-1488. There is a case folding filter there, as well as a
normalization filter, and they interact correctly for what you want :)

its my understanding that contrib/analyzers should not have any external
dependencies, so it could be eons before the jdk exposes these things, so I
don't know what to do. It would be nice if things like ArabicAnalyzer
handled greek edge cases correctly, don't you think?

On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <rcmuir@gmail.com> wrote:

>  I'm suggesting that if I know my input document well and know that it has
>> mixed text and that the text is Arabic and one other known language that I
>> might want to augment the stop list with stop words appropriate for that
>> known language. I think that in this case, stop filter should be after lower
>> case filter.
>>
>  I think this is a good idea?
>
>>
>> As to lower casing across the board, I also think it is pretty safe. But I
>> think there are some edge cases. For example, lowercasing a Greek word in
>> all upper case ending in sigma will not produce the same as lower casing the
>> same Greek word in all lower case. The Greek word should have a final sigma
>> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a
>> LowerCaseFilter would handle this case.
>>
> or you could use unicode case folding. lowercasing is for display purposes,
> not search.
>
>>
>> IMHO, this is not an issue for the Arabic or Persian analyzers.
>>
>> -- DM
>>
>>
>> On 10/08/2009 09:36 AM, Robert Muir wrote:
>>
>> DM, i suppose. but this is a tricky subject, what if you have mixed Arabic
>> / German or something like that?
>>
>> for some other languages written in the Latin script, English stopwords
>> could be bad :)
>>
>> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
>> across the board though.
>>
>> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith555@gmail.com> wrote:
>>
>>> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>>>
>>>> Just an addition: The lowercase filter is only for the case of embedded
>>>> non-arabic words. And these will not appear in the stop words.
>>>>
>>>>
>>>  I learned something new!
>>>
>>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
>>> augment the stopwords list with English stop words? And if so, shouldn't the
>>> stop filter come after the lower case filter?
>>>
>>> -- DM
>>>
>>>  -----Original Message-----
>>>>> From: Basem Narmok [mailto:narmok@gmail.com]
>>>>> Sent: Thursday, October 08, 2009 4:20 PM
>>>>> To: java-dev@lucene.apache.org
>>>>> Subject: Re: Arabic Analyzer: possible bug
>>>>>
>>>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>>>>> stop word list needs some corrections and may miss some common/stop
>>>>> Arabic words.
>>>>>
>>>>> Best,
>>>>>
>>>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith555@gmail.com>
 wrote:
>>>>>
>>>>>
>>>>>> Robert,
>>>>>> Thanks for the info.
>>>>>> As I said, I am illiterate in Arabic. So I have another, perhaps
>>>>>> nonsensical, question:
>>>>>> Does the stop word list have every combination of upper/lower case
for
>>>>>>
>>>>>>
>>>>> each
>>>>>
>>>>>
>>>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should
>>>>>> it
>>>>>>
>>>>>>
>>>>> come
>>>>>
>>>>>
>>>>>> after LowerCaseFilter?
>>>>>> -- DM
>>>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>>>>>
>>>>>> DM, this isn't a bug.
>>>>>>
>>>>>> The arabic stopwords are not normalized.
>>>>>>
>>>>>> but for persian, i normalized the stopwords. mostly because i did
not
>>>>>>
>>>>>>
>>>>> want
>>>>>
>>>>>
>>>>>> to have to create variations with farsi yah versus arabic yah for
each
>>>>>>
>>>>>>
>>>>> one.
>>>>>
>>>>>
>>>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith555@gmail.com>
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I
don't
>>>>>>>
>>>>>>>
>>>>>>  know
>>>>>
>>>>>
>>>>>> Arabic or Farsi, but have some texts to index in those languages.)
>>>>>>> The tokenizer/filter chain for ArabicAnalyzer is:
>>>>>>>         TokenStream result = new ArabicLetterTokenizer( reader
);
>>>>>>>         result = new StopFilter( result, stoptable );
>>>>>>>         result = new LowerCaseFilter(result);
>>>>>>>         result = new ArabicNormalizationFilter( result );
>>>>>>>         result = new ArabicStemFilter( result );
>>>>>>>
>>>>>>>         return result;
>>>>>>>
>>>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>>>>>>
>>>>>>> As a comparison the PersianAnalyzer has:
>>>>>>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>>>>>>     result = new LowerCaseFilter(result);
>>>>>>>     result = new ArabicNormalizationFilter(result);
>>>>>>>     /* additional persian-specific normalization */
>>>>>>>     result = new PersianNormalizationFilter(result);
>>>>>>>     /*
>>>>>>>      * the order here is important: the stopword list is normalized
>>>>>>>
>>>>>>>
>>>>>>  with
>>>>>
>>>>>
>>>>>> the
>>>>>>>      * above!
>>>>>>>      */
>>>>>>>     result = new StopFilter(result, stoptable);
>>>>>>>
>>>>>>>     return result;
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> DM
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Robert Muir
>>>>>> rcmuir@gmail.com
>>>>>>
>>>>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message