lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!
Date Fri, 20 Feb 2009 12:03:36 GMT
It's been a few years since I've worked on Arabic, but it sounds  
reasonable.  Care to submit a patch with unit tests showing the  
StandardTokenizer properly handling all Arabic characters?  http://wiki.apache.org/lucene-java/HowToContribute


On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:

> Hi Everyone,
>
>
> My question is related to the arabic analysis package under:  
> org.apache.lucene.analysis.ar
>
>
> It is cool and it is doing a great job, but it uses a special  
> tokenizer: ArabicLetterTokenizer
>
>
> The problem with this tokenizer is that it fails to handle emails,  
> urls and acronyms the same way the StandardTokenizer does.
>
>
> Also the problem of the StandardTokenizer is that it fails to handle  
> arabic diacritics right. so it splits words which shouldn't be  
> splitted.
>
>
> Arabic diacritics are: (as mentioned in the class:  
> org.apache.lucene.analysis.ar.ArabicNormalizer)
>
>
> FATHATAN = '\u064B';
> DAMMATAN = '\u064C';
> KASRATAN = '\u064D';
> FATHA = '\u064E';
> DAMMA = '\u064F';
> KASRA = '\u0650';
> SHADDA = '\u0651';
> SUKUN = '\u0652';
>
>
> so it is the range [\u064B-\u0652]
>
>
> Is it possible to modify the StandardTokenizerImp to consider these  
> diacritics as normal letters.
>
>
> I guess it should be done the same way its is done for Chinese and  
> Japanese in this line in the file StandardTokenizerImp.jflex
>
>
> // Chinese and Japanese (but NOT Korean, which is included in  
> [:letter:])
>
> CJ         = [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF 
> \u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]
>
>
> so it can be something like:
>
> AR = [\u064B-\u0652]
>
>
> then modify this line also to include our new group of characters:
>
>
> // From the JFlex manual: "the expression that matches everything of  
> <a> not matched by <b> is !(!<a>|<b>)"
> LETTER     = !(![:letter:]|{CJ}|{AR})
>
>
>
> Am I right?! and am I going in the right direction?!! Comments are  
> very welcome.
>
>
> Regards..
>
>
> Yusuf
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message