lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: lucene farsi problem
Date Fri, 02 May 2008 17:23:36 GMT
Hi Esra,

I have created an issue for this - see <https://issues.apache.org/jira/browse/LUCENE-1279>.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:
> 
> Hi Steven ,
> 
> yes you are right, sorry i am a bit confused.
> 
> i checked again and the correct one is  "zhe"/U+698.
> 
> It seems the word is in the range but my customer says it
> shouldn't be.
> 
> I think problem occurs because  "zhe" is a Persian letter
> outside the Arabic
> alphabet. In farsi alphabet this letter is not after the "س"
> letter but it's
> unicode is bigger than "س" letter's and the searcher works
> with unicodes.
> 
> Esra
> 
> 
> Steven A Rowe wrote:
> > 
> > Hi Esra,
> > 
> > You are *still* incorrectly referring to the glyph with three dots over
> > it:
> > 
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> > 
> > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > 
> > Have you increased the font size?  Can you see the difference between
> > these two?:
> > 
> > "ژ"/"zhe"/U+698
> > "ز"/"ze"/U+632
> > 
> > > my problem is when i do search for  "د-ژ" range. The result is  "ساب
> > > ووفر" and this word's first letter is "س" and it's unicode is "U+633"

> > > and it is not in the in the [ U+062F - U+0632 ] range.
> > 
> > Like I keep saying, in the above description, you're using the glyph
> > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > to it as "ze"/U+632.
> > 
> > I don't mean to continually bang on about this - if you're *sure* that
> > when you search, you're using the character represented by the glyph
> > with one dot (and not three), i.e. "ز"/"ze"/U+632, then the problem
> > lies elsewhere.
> > 
> > Steve
> > 
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> > > 
> > > my problem is when i do search for  "  د-ژ" range. The result
> > > is  ""ساب ووفر
> > > " and this word's first letter is "س " and it's unicode is
> > > "U+633"  and  it
> > > is not in the in the [ U+062F - U+0632 ] range.
> > > 
> > > am i wrong?
> > > 
> > > Esra
> > > 
> > > Steven A Rowe wrote:
> > > > 
> > > > Hi Esra,
> > > > 
> > > > I still think you're wrong :).
> > > > 
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > ژ = U+632
> > > > 
> > > > According to the website you linked to, the above character, which
> > > > has three dots over it, is named "zhe", and its Unicode code point is
> > > > U+698. (I had to increase the font size to see the three dots.)
> > > > 
> > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > > 
> > > > Unless you were mistaken in all of your emails when you included the
> > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > previous email still stands: there is no problem here.
> > > > 
> > > > Steve
> > > > 
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > 
> > > > > Hi Steven,
> > > > > 
> > > > > sorry i made a mistake. unicodes are like this:
> > > > > 
> > > > > > د=U+62F
> > > > > > ژ = U+632
> > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > 
> > > > > you can also check them here
> > > > > > 
> http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > 
> > > > > Esra
> > > > > 
> > > > > 
> > > > > Steven A Rowe wrote:
> > > > > > 
> > > > > > Hi Esra,
> > > > > > 
> > > > > > Going back to the original problem statement, I see something
that
> > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > > 
> > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > i am using lucene's "IndexSearcher" to search the given
xml by
> > > > > > > keyword which contains farsi information. while searching
i use
> > > > > > > ranges like
> > > > > > > 
> > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل
 |  م-ی
> > > > > > > 
> > > > > > > when i do search for  "د-ژ"  range the results are wrong
, they
> > > > > > > are the results of  " س-ظ "range.
> > > > > > > 
> > > > > > > for example when i do search for "د-ژ"  one of the the
results is
> > > > > > > "ساب ووفر", this result also shown on the " س-ظ
" range's result
> > > > > > > list which is the corret range.
> > > > > > > 
> > > > > > > As IndexSearcher use "compareTo" method and this method
uses
> > > > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > > > > 
> > > > > > > د=U+62F
> > > > > > > ژ = U+698
> > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > 
> > > > > > It appears to me that *both* the "د-ژ" range [
> U+062F - U+0698 ]
> > > and
> > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> first letter of
> > > "ساب
> > > > > > ووفر", which is "س" = U+0633.
> > > > > > 
> > > > > > You stated that U+0633 should be contained in the [
> U+0633 - U+0638
> > > ]
> > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > > 
> > > > > > In other words, it looks to me like your problem is
> not a problem
> > > at
> > > > > > all.
> > > > > > 
> > > > > > Steve
> > > > > > 
> > > > > > 
> > > > > 
> > > > > -- View this message in context:
> > > > > 
> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > .html Sent
> > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > 
> > > > 
> > > > 
> ---------------------------------------------------------------------
> > > To
> > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> >  -- View this message in context:
> >  http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> >  Sent from the Lucene - Java Users mailing list archive at
> Nabble.com.
> > 
> > 
> > 
> ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > 
> > 
> > 
> 
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

 

Mime
View raw message