lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Morphological Search Problem
Date Mon, 02 Apr 2007 12:55:34 GMT
Have you used Luke to see what is actually in the index?  Or written  
some test cases for your analyzer to know that the appropriate tokens  
are coming out of your analyzer?

Also, could you give more details about the filters you are using?  I  
am not familiar w/ ExactTokensConstructorFilter, etc.

The formatting is a little hard to read, but I think it says you are  
passing the ArabicStemmer to the SnowballFilter, correct?  I assume  
you are dealing w/ mixed content, correct?  That is, you have Arabic  
and English in the same token stream? I know when I was working on  
our Arabic/English project, we had to be careful about mixed content  
like this.



On Apr 2, 2007, at 7:57 AM, Shaimaa Mohamed wrote:

> Dear all,
>
> We are using a Unified Analyzer as the analyzer of Lucene so as to be
> able to index and search Arabic and English documents as well.
>
> Here is the code:
>
>
>
> public TokenStream tokenStream(String FieldName, Reader reader)
>
>     {
>
>
>
>             switch(analysisMode) {
>
>                   case UNIFIED:
>
>                         return new ExactTokensContructorFilter(
>
>                                     new SnowballFilter(
>
>                                                       new  
> ArabicStemmer(
>
>                                                                   new
> ExactTokensSpecifierFilter(
>
>
> getStandardAnalyzerStream(
>
>
> reader)),
>
>                                                              
> false,false)
>
>                                 ,latinLanguage));
>
>                   case EXACT:
>
>                         return new ExactTokensContructorFilter(
>
>                                                 new
> ExactTokensSpecifierFilter(
>
>
> getStandardAnalyzerStream(
>
>
> reader)));
>
>             }
>
>             return null;
>
>     }
>
>
>
> But the problem is that the results of the morphological search in
> English and Arabic are not good, for example:
>
> The data in which I search contains "test", "testing" and "tested",  
> then
> when I search for "testing", it doesn't give "test" in the search
> results, although that when I traced it I found that the tokens of
> "testing" contains "test". But when I search for "manage", it gives me
> "management" in the search results which is correct. So what's the
> difference between both cases?
>
>
>
> Beside that I tried to use only the Snowball Analyzer instead of the
> Unified Analyzer and apply the same test but this time it gives  
> correct
> and good results!!
>
> So can anyone help, why using Unified Analyzer affects the results?
>
>
>
> Note: latinLanguage in the above code = "English"
>
>
>
> Thanks & Best Regards,
>
> ------------------------------------
>
> Shaimaa Mohamed
>
> Team Leader
>
> ICT Department
>
> Bibliotheca Alexandrina
>
> P.O. Box 138, Chatby
>
> Alexandria 21526, Egypt
>
> Tel: +(203) 483 9999, Ext:1418
>
> Fax: +(203) 482 0405
>
> Email: Shaimaa.Mohamed@bibalex.org
> <BLOCKED::mailto:Shaimaa.Mohamed@bibalex.org>
>
> Web Site: www.bibalex.org <blocked::http://www.bibalex.org>
>
>
>
>
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message