lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 16:37:51 GMT
On 10/08/2009 11:46 AM, Robert Muir wrote:
> DM by the way, if you want this lowercasing behavior with edge cases, 
> check out LUCENE-1488. There is a case folding filter there, as well 
> as a normalization filter, and they interact correctly for what you 
> want :)
Robert,

So cool. I've been following the emails on this to java-dev that JIRA 
puts out, but I had not looked at the patch till now. Brought tears to 
my eyes.

How ready is it? I'd like to use it if it is "good enough".

BTW, does it handle the case where ' (an apostrophe) is used as a 
character in some languages? (IIRC in some African languages it is a 
whistle.) That is, do you know whether ICU will consider the context of 
adjacent characters in determining whether something is a word break?

>
> its my understanding that contrib/analyzers should not have any 
> external dependencies,
That's my understanding too. But there has got to be a way to provide it 
w/o duplication of code.

> so it could be eons before the jdk exposes these things
I'm using ICU now for that very reason. It takes too long for the JDK to 
be current on anything let alone something that Java boasted of in the 
early days.


> , so I don't know what to do. It would be nice if things like 
> ArabicAnalyzer handled greek edge cases correctly, don't you think?

I do think so. Maybe in the new package (org.apache.lucene.icu) have a 
subpackage analyzer that's dependant on contrib/analyzers. Or create a 
PluggableAnalyzer that one could supply a Tokenizer and an ordered list 
of Filters, changing the contrib/analyzers to derive from it. Or, use 
reflection to bring in the ICU ability if the lucene-icu.jar is present. 
Or, ...

Right now, for each of the contrib/analyzers I have my own copy that 
mimics them but doesn't use the StandardAnalyzer/StandardFilter (I think 
I want to use LUCENE-1488), does NFKC normalization, optionally uses a 
StopFilter (sometimes it is hard to dig out the stop set from the 
analyzers) and optionally uses a stemmer (snowball if available.) 
Basically, I like all the parts that were provided by a 
contrib/analyzer, but I have different requirements than how those parts 
were packaged by the contrib/analyzer's Analyzer. (Thus my question on 
the order of filters in ArabicAnalyzer).

It'd really be nice if there were a way to specify that "tool chain". 
Ideally, I'd like to get the default chain, and modify it. (And I'd like 
to store a description of that tool chain with the index, with version 
info for each of the parts, so that I can tell when an index needs to be 
rebuilt.)

-- DM

>
> On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <rcmuir@gmail.com 
> <mailto:rcmuir@gmail.com>> wrote:
>
>         I'm suggesting that if I know my input document well and know
>         that it has mixed text and that the text is Arabic and one
>         other known language that I might want to augment the stop
>         list with stop words appropriate for that known language. I
>         think that in this case, stop filter should be after lower
>         case filter.
>
>      I think this is a good idea?
>
>
>         As to lower casing across the board, I also think it is pretty
>         safe. But I think there are some edge cases. For example,
>         lowercasing a Greek word in all upper case ending in sigma
>         will not produce the same as lower casing the same Greek word
>         in all lower case. The Greek word should have a final sigma
>         rather than a small sigma. For Greek, using an UpperCaseFilter
>         followed by a LowerCaseFilter would handle this case.
>
>     or you could use unicode case folding. lowercasing is for display
>     purposes, not search.
>
>
>         IMHO, this is not an issue for the Arabic or Persian analyzers.
>
>         -- DM
>
>
>         On 10/08/2009 09:36 AM, Robert Muir wrote:
>>         DM, i suppose. but this is a tricky subject, what if you have
>>         mixed Arabic / German or something like that?
>>
>>         for some other languages written in the Latin script, English
>>         stopwords could be bad :)
>>
>>         I think that Lowercasing non-Arabic (also cyrillic, etc), is
>>         pretty safe across the board though.
>>
>>         On Thu, Oct 8, 2009 at 9:29 AM, DM Smith
>>         <dmsmith555@gmail.com <mailto:dmsmith555@gmail.com>> wrote:
>>
>>             On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>>
>>                 Just an addition: The lowercase filter is only for
>>                 the case of embedded
>>                 non-arabic words. And these will not appear in the
>>                 stop words.
>>
>>             I learned something new!
>>
>>             Hmm. If one has a mixed Arabic / English text, shouldn't
>>             one be able to augment the stopwords list with English
>>             stop words? And if so, shouldn't the stop filter come
>>             after the lower case filter?
>>
>>             -- DM
>>
>>
>>                     -----Original Message-----
>>                     From: Basem Narmok [mailto:narmok@gmail.com
>>                     <mailto:narmok@gmail.com>]
>>                     Sent: Thursday, October 08, 2009 4:20 PM
>>                     To: java-dev@lucene.apache.org
>>                     <mailto:java-dev@lucene.apache.org>
>>                     Subject: Re: Arabic Analyzer: possible bug
>>
>>                     DM, there is no upper/lower cases in Arabic, so
>>                     don't worry, but the
>>                     stop word list needs some corrections and may
>>                     miss some common/stop
>>                     Arabic words.
>>
>>                     Best,
>>
>>                     On Thu, Oct 8, 2009 at 4:14 PM, DM
>>                     Smith<dmsmith555@gmail.com
>>                     <mailto:dmsmith555@gmail.com>>  wrote:
>>
>>                         Robert,
>>                         Thanks for the info.
>>                         As I said, I am illiterate in Arabic. So I
>>                         have another, perhaps
>>                         nonsensical, question:
>>                         Does the stop word list have every
>>                         combination of upper/lower case for
>>
>>                     each
>>
>>                         Arabic word in the list? (i.e. is it fully
>>                         de-normalized?) Or should it
>>
>>                     come
>>
>>                         after LowerCaseFilter?
>>                         -- DM
>>                         On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>
>>                         DM, this isn't a bug.
>>
>>                         The arabic stopwords are not normalized.
>>
>>                         but for persian, i normalized the stopwords.
>>                         mostly because i did not
>>
>>                     want
>>
>>                         to have to create variations with farsi yah
>>                         versus arabic yah for each
>>
>>                     one.
>>
>>                         On Thu, Oct 8, 2009 at 7:24 AM, DM
>>                         Smith<dmsmith555@gmail.com
>>                         <mailto:dmsmith555@gmail.com>>  wrote:
>>
>>                             I'm wondering if there is  a bug in
>>                             ArabicAnalyzer in 2.9. (I don't
>>
>>                     know
>>
>>                             Arabic or Farsi, but have some texts to
>>                             index in those languages.)
>>                             The tokenizer/filter chain for
>>                             ArabicAnalyzer is:
>>                                     TokenStream result = new
>>                             ArabicLetterTokenizer( reader );
>>                                     result = new StopFilter( result,
>>                             stoptable );
>>                                     result = new LowerCaseFilter(result);
>>                                     result = new
>>                             ArabicNormalizationFilter( result );
>>                                     result = new ArabicStemFilter(
>>                             result );
>>
>>                                     return result;
>>
>>                             Shouldn't the StopFilter come after
>>                             ArabicNormalizationFilter?
>>
>>                             As a comparison the PersianAnalyzer has:
>>                                 TokenStream result = new
>>                             ArabicLetterTokenizer(reader);
>>                                 result = new LowerCaseFilter(result);
>>                                 result = new
>>                             ArabicNormalizationFilter(result);
>>                                 /* additional persian-specific
>>                             normalization */
>>                                 result = new
>>                             PersianNormalizationFilter(result);
>>                                 /*
>>                                  * the order here is important: the
>>                             stopword list is normalized
>>
>>                     with
>>
>>                             the
>>                                  * above!
>>                                  */
>>                                 result = new StopFilter(result,
>>                             stoptable);
>>
>>                                 return result;
>>
>>
>>                             Thanks,
>>                             DM
>>
>>
>>                         --
>>                         Robert Muir
>>                         rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>>
>
>
>
>
>     -- 
>     Robert Muir
>     rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com <mailto:rcmuir@gmail.com>


Mime
View raw message