DM by the way, if you want this lowercasing behavior with edge cases, check out LUCENE-1488. There is a case folding filter there, as well as a normalization filter, and they interact correctly for what you want :)Robert,
That's my understanding too. But there has got to be a way to provide it w/o duplication of code.
its my understanding that contrib/analyzers should not have any external dependencies,
so it could be eons before the jdk exposes these thingsI'm using ICU now for that very reason. It takes too long for the JDK to be current on anything let alone something that Java boasted of in the early days.
, so I don't know what to do. It would be nice if things like ArabicAnalyzer handled greek edge cases correctly, don't you think?
On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <firstname.lastname@example.org> wrote:
I'm suggesting that if I know my input document well and know that it has mixed text and that the text is Arabic and one other known language that I might want to augment the stop list with stop words appropriate for that known language. I think that in this case, stop filter should be after lower case filter.
I think this is a good idea?
As to lower casing across the board, I also think it is pretty safe. But I think there are some edge cases. For example, lowercasing a Greek word in all upper case ending in sigma will not produce the same as lower casing the same Greek word in all lower case. The Greek word should have a final sigma rather than a small sigma. For Greek, using an UpperCaseFilter followed by a LowerCaseFilter would handle this case.
or you could use unicode case folding. lowercasing is for display purposes, not search.
IMHO, this is not an issue for the Arabic or Persian analyzers.
On 10/08/2009 09:36 AM, Robert Muir wrote:DM, i suppose. but this is a tricky subject, what if you have mixed Arabic / German or something like that?
for some other languages written in the Latin script, English stopwords could be bad :)
I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe across the board though.
On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <email@example.com> wrote:
On 10/08/2009 09:23 AM, Uwe Schindler wrote:I learned something new!
Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.
Hmm. If one has a mixed Arabic / English text, shouldn't one be able to augment the stopwords list with English stop words? And if so, shouldn't the stop filter come after the lower case filter?
From: Basem Narmok [mailto:firstname.lastname@example.org]
Sent: Thursday, October 08, 2009 4:20 PM
Subject: Re: Arabic Analyzer: possible bug
DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<email@example.com> wrote:
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps
Does the stop word list have every combination of upper/lower case for
Arabic word in the list? (i.e. is it fully de-normalized?) Or should itcome
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
DM, this isn't a bug.
The arabic stopwords are not normalized.
but for persian, i normalized the stopwords. mostly because i did not
to have to create variations with farsi yah versus arabic yah for eachone.
On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<firstname.lastname@example.org> wrote:know
I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't
withArabic or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter( result );
result = new ArabicStemFilter( result );
Shouldn't the StopFilter come after ArabicNormalizationFilter?
As a comparison the PersianAnalyzer has:
TokenStream result = new ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
* the order here is important: the stopword list is normalized
result = new StopFilter(result, stoptable);