lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From esra <esraer...@gmail.com>
Subject RE: lucene farsi problem
Date Thu, 01 May 2008 07:08:10 GMT

Hi Steve,

thanks for your reply , i know farsi is written and read right-to-left.
i am using RangeOuery class and it's rewrite(IndexReader reader) method
decides if the word is in range or not by compareTo method and this decision
is made by using unicodes.

while searching for "د-ژ" range the lowerTerm is "د" and  the upperTerm is
"ژ". 
And while comparing for the result "ساب ووفر" also takes the first letter as
س and does the comparison for this letter.

 د=U+62F
 ژ = U+698
 and the first letter of "ساب ووفر " is  س = U+633

Esra,


Steven A Rowe wrote:
> 
> Hi Esra,
> 
> Caveat: I don't speak, read, write, or dream in Farsi - I just know that
> it mostly shares its orthography with Arabic, and that they are both
> written and read right-to-left.
> 
> How are you constructing the queries?  Using QueryParser?  If so, then I
> suspect the problem is that you intend the range you supply to be read
> entirely right-to-left, but Lucene instead reads it left-to-right.  Have
> you tried using e.g. "د-ژ" instead of "د-ژ"?  (That is, placing the lower
> valued term on the left instead of the right.)
> 
> AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called
> from QueryParser) does not test whether lowerTerm is in fact lower than
> upperTerm.  If it turns out that the problem is simply one of order, it
> might make sense to modify RangeFilter so that it flip them when lowerTerm
> > upperTerm.
> 
> Steve
> 
> On 04/30/2008 at 3:21 AM, esra wrote:
>> 
>> hi,
>> 
>> i am using lucene's "IndexSearcher" to search the given xml
>> by keyword which
>> contains farsi information.
>> while searching i use ranges like
>> 
>> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>> 
>> when i do search for  "د-ژ"  range the results are wrong ,
>> they are the
>> results of  " س-ظ "range.
>> 
>> for example when i do search for "د-ژ"  one of the the
>> results is "ساب ووفر"
>> , this result also shown on the " س-ظ " range's result list
>> which is the
>> corret range.
>> 
>> As IndexSearcher use "compareTo" method and this method uses
>> unicodes for
>> comparing, i found the unicodes of the characters.
>> 
>> د=U+62F
>> ژ = U+698
>> and the first letter of "ساب ووفر " is  س = U+633
>> 
>> Do you have any idea how to solve this problem, there are
>> analyzers for
>> different languages ,
>> will this be usefull if so do you know where to find a farsi analyzer?
>> 
>> I would bu glad if you help.
>> 
>> thanks ,
>> 
>> Esra
>> 
>> -- View this message in context:
>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
>> from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>>
> 
>  
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16993041.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message