lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 15:28:29 GMT
Robert,
Yes it is tricky.

I'm not suggesting that the ArabicAnalyzer have any stopwords other than 
Arabic.

I'm suggesting that if I know my input document well and know that it 
has mixed text and that the text is Arabic and one other known language 
that I might want to augment the stop list with stop words appropriate 
for that known language. I think that in this case, stop filter should 
be after lower case filter.

As to lower casing across the board, I also think it is pretty safe. But 
I think there are some edge cases. For example, lowercasing a Greek word 
in all upper case ending in sigma will not produce the same as lower 
casing the same Greek word in all lower case. The Greek word should have 
a final sigma rather than a small sigma. For Greek, using an 
UpperCaseFilter followed by a LowerCaseFilter would handle this case.

IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM

On 10/08/2009 09:36 AM, Robert Muir wrote:
> DM, i suppose. but this is a tricky subject, what if you have mixed 
> Arabic / German or something like that?
>
> for some other languages written in the Latin script, English 
> stopwords could be bad :)
>
> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty 
> safe across the board though.
>
> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith555@gmail.com 
> <mailto:dmsmith555@gmail.com>> wrote:
>
>     On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>
>         Just an addition: The lowercase filter is only for the case of
>         embedded
>         non-arabic words. And these will not appear in the stop words.
>
>     I learned something new!
>
>     Hmm. If one has a mixed Arabic / English text, shouldn't one be
>     able to augment the stopwords list with English stop words? And if
>     so, shouldn't the stop filter come after the lower case filter?
>
>     -- DM
>
>
>             -----Original Message-----
>             From: Basem Narmok [mailto:narmok@gmail.com
>             <mailto:narmok@gmail.com>]
>             Sent: Thursday, October 08, 2009 4:20 PM
>             To: java-dev@lucene.apache.org
>             <mailto:java-dev@lucene.apache.org>
>             Subject: Re: Arabic Analyzer: possible bug
>
>             DM, there is no upper/lower cases in Arabic, so don't
>             worry, but the
>             stop word list needs some corrections and may miss some
>             common/stop
>             Arabic words.
>
>             Best,
>
>             On Thu, Oct 8, 2009 at 4:14 PM, DM
>             Smith<dmsmith555@gmail.com <mailto:dmsmith555@gmail.com>>
>              wrote:
>
>                 Robert,
>                 Thanks for the info.
>                 As I said, I am illiterate in Arabic. So I have
>                 another, perhaps
>                 nonsensical, question:
>                 Does the stop word list have every combination of
>                 upper/lower case for
>
>             each
>
>                 Arabic word in the list? (i.e. is it fully
>                 de-normalized?) Or should it
>
>             come
>
>                 after LowerCaseFilter?
>                 -- DM
>                 On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
>                 DM, this isn't a bug.
>
>                 The arabic stopwords are not normalized.
>
>                 but for persian, i normalized the stopwords. mostly
>                 because i did not
>
>             want
>
>                 to have to create variations with farsi yah versus
>                 arabic yah for each
>
>             one.
>
>                 On Thu, Oct 8, 2009 at 7:24 AM, DM
>                 Smith<dmsmith555@gmail.com
>                 <mailto:dmsmith555@gmail.com>>  wrote:
>
>                     I'm wondering if there is  a bug in ArabicAnalyzer
>                     in 2.9. (I don't
>
>             know
>
>                     Arabic or Farsi, but have some texts to index in
>                     those languages.)
>                     The tokenizer/filter chain for ArabicAnalyzer is:
>                             TokenStream result = new
>                     ArabicLetterTokenizer( reader );
>                             result = new StopFilter( result, stoptable );
>                             result = new LowerCaseFilter(result);
>                             result = new ArabicNormalizationFilter(
>                     result );
>                             result = new ArabicStemFilter( result );
>
>                             return result;
>
>                     Shouldn't the StopFilter come after
>                     ArabicNormalizationFilter?
>
>                     As a comparison the PersianAnalyzer has:
>                         TokenStream result = new
>                     ArabicLetterTokenizer(reader);
>                         result = new LowerCaseFilter(result);
>                         result = new ArabicNormalizationFilter(result);
>                         /* additional persian-specific normalization */
>                         result = new PersianNormalizationFilter(result);
>                         /*
>                          * the order here is important: the stopword
>                     list is normalized
>
>             with
>
>                     the
>                          * above!
>                          */
>                         result = new StopFilter(result, stoptable);
>
>                         return result;
>
>
>                     Thanks,
>                     DM
>
>
>                 --
>                 Robert Muir
>                 rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>


Mime
View raw message