lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: lucene farsi problem
Date Mon, 05 May 2008 04:49:46 GMT
Hi Esra,

I have attached a patch to LUCENE-1279 containing a new class: CollatingRangeQuery.

The patch also contains a test class: TestCollatingRangeQuery.  One of the test methods checks
for the Farsi range you were having trouble with.

It should be mentioned that according to Collator.getAvailableLocales(), neither Java 1.4.2
nor Java 1.5.0 contains Farsi collation information.  However, in the test class I use the
Arabic Locale, which seems to properly collate the non-Arabic Farsi letter U+0698, and hopefully
behaves well with other Farsi letters.  If you find that this is not the case, you can look
into writing collation rules using RuleBasedCollator - you should be able to directly specify
the proper letter orderings for Farsi; CollatingRangeQuery would also have to supply a constructor
that takes in a Collator instead of a Locale.

Please give the class a try and post back about how it works.

Thanks,
Steve

On 05/03/2008 at 8:33 AM, esra wrote:
> 
> Hi Steven,
> 
> thanks for your help....
> 
> Esra
> 
> 
> Steven A Rowe wrote:
> > 
> > Hi Esra,
> > 
> > I have created an issue for this - see
> > <https://issues.apache.org/jira/browse/LUCENE-1279>.
> > 
> > I'll try to take a crack at a patch this weekend.
> > 
> > Steve
> > 
> > On 05/02/2008 at 12:55 PM, esra wrote:
> > > 
> > > Hi Steven ,
> > > 
> > > yes you are right, sorry i am a bit confused.
> > > 
> > > i checked again and the correct one is  "zhe"/U+698.
> > > 
> > > It seems the word is in the range but my customer says it
> > > shouldn't be.
> > > 
> > > I think problem occurs because  "zhe" is a Persian letter
> > > outside the Arabic
> > > alphabet. In farsi alphabet this letter is not after the "س"
> > > letter but it's
> > > unicode is bigger than "س" letter's and the searcher works
> > > with unicodes.
> > > 
> > > Esra
> > > 
> > > 
> > > Steven A Rowe wrote:
> > > > 
> > > > Hi Esra,
> > > > 
> > > > You are *still* incorrectly referring to the glyph with three dots
> > > > over it:
> > > > 
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > 
> > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > > > 
> > > > Have you increased the font size?  Can you see the difference between
> > > > these two?:
> > > > 
> > > > "ژ"/"zhe"/U+698
> > > > "ز"/"ze"/U+632
> > > > 
> > > > > my problem is when i do search for  "د-ژ" range. The result is
 "ساب
> > > > > ووفر" and this word's first letter is "س" and it's unicode is
> > > > > "U+633" and it is not in the in the [ U+062F - U+0632 ] range.
> > > > 
> > > > Like I keep saying, in the above description, you're using the glyph
> > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > > > to it as "ze"/U+632.
> > > > 
> > > > I don't mean to continually bang on about this - if you're *sure*
> > > > that when you search, you're using the character represented by the
> > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the
> > > > problem lies elsewhere.
> > > > 
> > > > Steve
> > > > 
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > > 
> > > > > my problem is when i do search for  "  د-ژ" range. The result
> > > > > is  ""ساب ووفر
> > > > > " and this word's first letter is "س " and it's unicode is
> > > > > "U+633"  and  it
> > > > > is not in the in the [ U+062F - U+0632 ] range.
> > > > > 
> > > > > am i wrong?
> > > > > 
> > > > > Esra
> > > > > 
> > > > > Steven A Rowe wrote:
> > > > > > 
> > > > > > Hi Esra,
> > > > > > 
> > > > > > I still think you're wrong :).
> > > > > > 
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > ژ = U+632
> > > > > > 
> > > > > > According to the website you linked to, the above character,
which
> > > > > > has three dots over it, is named "zhe", and its
> Unicode code point
> > > is
> > > > > > U+698. (I had to increase the font size to see the three dots.)
> > > > > > 
> > > > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > > > > 
> > > > > > Unless you were mistaken in all of your emails when
> you included
> > > the
> > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said
in my
> > > > > > previous email still stands: there is no problem here.
> > > > > > 
> > > > > > Steve
> > > > > > 
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > 
> > > > > > > Hi Steven,
> > > > > > > 
> > > > > > > sorry i made a mistake. unicodes are like this:
> > > > > > > 
> > > > > > > > د=U+62F
> > > > > > > > ژ = U+632
> > > > > > > > and the first letter of "ساب ووفر " is  س
= U+633
> > > > > > > 
> > > > > > > you can also check them here
> > > > > > > > 
> > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > > > 
> > > > > > > Esra
> > > > > > > 
> > > > > > > 
> > > > > > > Steven A Rowe wrote:
> > > > > > > > 
> > > > > > > > Hi Esra,
> > > > > > > > 
> > > > > > > > Going back to the original problem statement, I
> see something
> > > that
> > > > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > > > > 
> > > > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > > > i am using lucene's "IndexSearcher" to search
> the given xml
> > > by
> > > > > > > > > keyword which contains farsi information.
> while searching i
> > > use
> > > > > > > > > ranges like
> > > > > > > > > 
> > > > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق
 |  ک-ل  |  م-ی
> > > > > > > > > 
> > > > > > > > > when i do search for  "د-ژ"  range the results
> are wrong ,
> > > they
> > > > > > > > > are the results of  " س-ظ "range.
> > > > > > > > > 
> > > > > > > > > for example when i do search for "د-ژ"  one
of the the results
> > > > > > > > > is "ساب ووفر", this result also shown
on the "
> س-ظ " range's
> > > result
> > > > > > > > > list which is the corret range.
> > > > > > > > > 
> > > > > > > > > As IndexSearcher use "compareTo" method and this
method uses
> > > > > > > > > unicodes for comparing, i found the unicodes
of the characters.
> > > > > > > > > 
> > > > > > > > > د=U+62F
> > > > > > > > > ژ = U+698
> > > > > > > > > and the first letter of "ساب ووفر " is
 س = U+633
> > > > > > > > 
> > > > > > > > It appears to me that *both* the "د-ژ" range [
> > > U+062F - U+0698 ]
> > > > > and
> > > > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> > > first letter of
> > > > > "ساب
> > > > > > > > ووفر", which is "س" = U+0633.
> > > > > > > > 
> > > > > > > > You stated that U+0633 should be contained in the
[
> > > U+0633 - U+0638
> > > > > ]
> > > > > > > > range - I agree - but why do you think U+0633 should
not be
> > > > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > > > > 
> > > > > > > > In other words, it looks to me like your problem is
> > > not a problem
> > > > > at
> > > > > > > > all.
> > > > > > > > 
> > > > > > > > Steve
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > -- View this message in context:
> > > > > > > 
> > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > > > .html Sent
> > > > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > > > 
> > > > > > 
> > > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > To
> > > > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For
> > > > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > -- View this message in context:
> > > > 
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> > > >  Sent from the Lucene - Java Users mailing list archive at
> > > Nabble.com.
> > > > 
> > > > 
> > > > 
> > > 
> ---------------------------------------------------------------------
> > > >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >  For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > -- View this message in context:
> > > 
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
 .html Sent
> > from the Lucene - Java Users mailing list archive at Nabble.com.
> > 
> > 
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> 
> 
> 
> 
> 
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
 For additional commands, e-mail: java-user-help@lucene.apache.org

 

Mime
View raw message