lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Thu, 02 Oct 2008 02:31:20 GMT
Can we have the Hebrew discussion on another thread?  FWIW, I do agree  
it would be a good thing to add.

Thanks,
Grant

On Oct 1, 2008, at 4:02 PM, Nadav Har'El wrote:

> On Tue, Sep 30, 2008, Robert Muir wrote about "Re: [jira] Commented:  
> (LUCENE-1406) new Arabic Analyzer (Apache license)":
>> Thanks for clarification. With this method arabic analyzer could  
>> lemmatize,
>> not stem, using buckwalter dictionary, and things like broken  
>> plural will
>> work correctly.
>>
>> I'm not sure yet if hspell has this type of information, but it  
>> would at
>> least be a better stem for hebrew as well.
>
> Indeed Hspell also has this information. You can see for example
> http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi?text=%E4%F8%EB%E1%FA&ling=on
> (but you'll need to be able to read Hebrew to understand what this  
> means).
>
> But one thing to remember is that if you use Hspell, or basically  
> any other
> dictionary, you are committing yourself to a particular vocabulary  
> and a
> particular spelling of it. If your stemmer comes across a word  
> outside your
> vocabulary, or spelled a bit differently, it won't know what to do  
> with it.
>
> This problem is particularly visible in Hebrew, because its unvowelled
> spelling standard (defined by the Academy of the Hebrew Language) is
> not very well known - When I was in school, twenty years ago, it  
> wasn't
> even mentioned, let alone taught! As a result, some words have a few  
> spelling
> variants in the wild, with each dictionary typically considering one  
> correct
> and the others mispellings.
>
> -- 
> Nadav Har'El                        |    Wednesday, Oct  1 2008, 3  
> Tishri 5769
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                    |The two most common elements in  
> the
> http://nadav.harel.org.il           |universe are hydrogen and  
> stupidity.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message