lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From esra <esraer...@gmail.com>
Subject RE: lucene farsi problem
Date Sat, 10 May 2008 17:22:41 GMT

Hi Steve,

i used the locale as "ar" and it works fine .

again thanks a lot for your help.

Esra


Steven A Rowe wrote:
> 
> Hi Esra,
> 
> On 05/06/2008 at 7:38 AM, esra wrote:
>> i tried the class and it works fine with the locale parameter "ar".
> 
> Cool, I'm glad this addressed your problem!
> 
>> Actually we are using "fa" for farsi and "ar" for arabic.
>> I have added a little control for the locale parameter in my
>> code and now i can see the correct results.
> 
> From what I could tell, the Collator available for Locale("fa") in the Sun
> 1.4.2 and 1.5.0 JDKs does not contain Farsi character collation, but the
> Collator available for Locale("ar") *does* contain Farsi collation.  I
> switched TestCollatingRangeQuery from Locale("fa") to Locale("ar") when I
> couldn't get the Collator returned for Farsi [ via
> Collator.getInstance(new Locale("fa") ] to produce correct results.
> 
> Did you find that Locale("fa") produces the correct results?  If so, which
> VM are you using?
> 
> At Chris Hostetter's suggestion, I am rewriting the patch attached to
> LUCENE-1279, including the following changes:
> 
> - Merged the contents of the CollatingRangeQuery class into RangeQuery and
> RangeFilter
> - Switched the Locale parameter to instead take an instance of Collator
> - Modified QueryParser.jj to construct a QueryParser class that can accept
> a range collator and pass it either to RangeQuery or through
> ConstantScoreRangeQuery to RangeFilter.
> 
> I plan on posting the revised patch in the next day or two.
> 
> Steve
> 
> On 05/06/2008 at 7:38 AM, esra wrote:
>> 
>> Hi Steven ,
>> Hi Steven,
>> 
>> i tried the class and it works fine with the locale parameter "ar".
>> 
>> Actually we are using "fa" for farsi and "ar" for arabic.
>> I have added a little control for the locale parameter in my
>> code and now i can see the correct results.
>> 
>> Thank you very much for ypur help.
>> 
>> Esra.
>> 
>> Steven A Rowe wrote:
>> > 
>> > Hi Esra,
>> > 
>> > I have attached a patch to LUCENE-1279 containing a new class:
>> > CollatingRangeQuery.
>> > 
>> > The patch also contains a test class: TestCollatingRangeQuery.  One of
>> > the test methods checks for the Farsi range you were having trouble
>> > with.
>> > 
>> > It should be mentioned that according to
>> > Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
>> > contains Farsi collation information. However, in the test class I use
>> > the Arabic Locale, which seems to properly collate the non-Arabic Farsi
>> > letter U+0698, and hopefully behaves well with other Farsi letters.  If
>> > you find that this is not the case, you can look into writing collation
>> > rules using RuleBasedCollator - you should be able to directly specify
>> > the proper letter orderings for Farsi; CollatingRangeQuery would also
>> > have to supply a constructor that takes in a Collator instead of a
>> > Locale.
>> > 
>> > Please give the class a try and post back about how it works.
>> > 
>> > Thanks,
>> > Steve
>> > 
>> > On 05/03/2008 at 8:33 AM, esra wrote:
>> > > 
>> > > Hi Steven,
>> > > 
>> > > thanks for your help....
>> > > 
>> > > Esra
>> > > 
>> > > 
>> > > Steven A Rowe wrote:
>> > > > 
>> > > > Hi Esra,
>> > > > 
>> > > > I have created an issue for this - see
>> > > > <https://issues.apache.org/jira/browse/LUCENE-1279>.
>> > > > 
>> > > > I'll try to take a crack at a patch this weekend.
>> > > > 
>> > > > Steve
>> > > > 
>> > > > On 05/02/2008 at 12:55 PM, esra wrote:
>> > > > > 
>> > > > > Hi Steven ,
>> > > > > 
>> > > > > yes you are right, sorry i am a bit confused.
>> > > > > 
>> > > > > i checked again and the correct one is  "zhe"/U+698.
>> > > > > 
>> > > > > It seems the word is in the range but my customer says it
>> > > > > shouldn't be.
>> > > > > 
>> > > > > I think problem occurs because  "zhe" is a Persian letter
>> > > > > outside the Arabic
>> > > > > alphabet. In farsi alphabet this letter is not after the "س"
>> > > > > letter but it's
>> > > > > unicode is bigger than "س" letter's and the searcher works
>> > > > > with unicodes.
>> > > > > 
>> > > > > Esra
>> > > > > 
>> > > > > 
>> > > > > Steven A Rowe wrote:
>> > > > > > 
>> > > > > > Hi Esra,
>> > > > > > 
>> > > > > > You are *still* incorrectly referring to the glyph with
three
>> dots
>> > > > > > over it:
>> > > > > > 
>> > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
>> > > > > > > yes the correct one is "ژ "/"ze"/U+632.
>> > > > > > 
>> > > > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
>> > > > > > 
>> > > > > > Have you increased the font size?  Can you see the difference
>> > > > > > between these two?:
>> > > > > > 
>> > > > > > "ژ"/"zhe"/U+698
>> > > > > > "ز"/"ze"/U+632
>> > > > > > 
>> > > > > > > my problem is when i do search for  "د-ژ" range.
>> The result is
>> > > "ساب
>> > > > > > > ووفر" and this word's first letter is "س" and
it's unicode is
>> > > > > > > "U+633" and it is not in the in the [ U+062F -
>> U+0632 ] range.
>> > > > > > 
>> > > > > > Like I keep saying, in the above description, you're
>> using the
>> > > glyph
>> > > > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly
>> > > > > > referring to it as "ze"/U+632.
>> > > > > > 
>> > > > > > I don't mean to continually bang on about this - if you're
>> *sure*
>> > > > > > that when you search, you're using the character represented
by
>> the
>> > > > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632,
then
>> the
>> > > > > > problem lies elsewhere.
>> > > > > > 
>> > > > > > Steve
>> > > > > > 
>> > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
>> > > > > > > yes the correct one is "ژ "/"ze"/U+632.
>> > > > > > > 
>> > > > > > > my problem is when i do search for  "  د-ژ" range.
The result
>> is 
>> > > > > > > ""ساب ووفر " and this word's first letter is
"س " and it's
>> unicode
>> > > > > > > is "U+633"  and  it is not in the in the [ U+062F -
U+0632 ]
>> range.
>> > > > > > > 
>> > > > > > > am i wrong?
>> > > > > > > 
>> > > > > > > Esra
>> > > > > > > 
>> > > > > > > Steven A Rowe wrote:
>> > > > > > > > 
>> > > > > > > > Hi Esra,
>> > > > > > > > 
>> > > > > > > > I still think you're wrong :).
>> > > > > > > > 
>> > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
>> > > > > > > > > > ژ = U+632
>> > > > > > > > 
>> > > > > > > > According to the website you linked to, the
>> above character,
>> > > which
>> > > > > > > > has three dots over it, is named "zhe", and its
>> > > Unicode code point
>> > > > > is
>> > > > > > > > U+698. (I had to increase the font size to see
the three
>> dots.)
>> > > > > > > > 
>> > > > > > > > I think you are confusing "ژ"/"zhe"/U+698 with
the letter
>> > > > > > > > "ز"/"ze"/U+632, which has just one dot over it.
>> > > > > > > > 
>> > > > > > > > Unless you were mistaken in all of your emails
when
>> > > you included
>> > > > > the
>> > > > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then
what I said
>> in my
>> > > > > > > > previous email still stands: there is no problem
here.
>> > > > > > > > 
>> > > > > > > > Steve
>> > > > > > > > 
>> > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
>> > > > > > > > > 
>> > > > > > > > > Hi Steven,
>> > > > > > > > > 
>> > > > > > > > > sorry i made a mistake. unicodes are like
this:
>> > > > > > > > > 
>> > > > > > > > > > د=U+62F
>> > > > > > > > > > ژ = U+632
>> > > > > > > > > > and the first letter of "ساب ووفر
" is  س = U+633
>> > > > > > > > > 
>> > > > > > > > > you can also check them here
>> > > > > > > > > > 
>> > > > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
>> > > > > > > > > 
>> > > > > > > > > Esra
>> > > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > Steven A Rowe wrote:
>> > > > > > > > > > 
>> > > > > > > > > > Hi Esra,
>> > > > > > > > > > 
>> > > > > > > > > > Going back to the original problem statement,
I
>> > > see something
>> > > > > that
>> > > > > > > > > > looks illogical to me - please correct
me if I'm wrong:
>> > > > > > > > > > 
>> > > > > > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
>> > > > > > > > > > > i am using lucene's "IndexSearcher"
to search
>> > > the given xml
>> > > > > by
>> > > > > > > > > > > keyword which contains farsi information.
>> > > while searching i
>> > > > > use
>> > > > > > > > > > > ranges like
>> > > > > > > > > > > 
>> > > > > > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ
 |  ع-ق  |  ک-ل  |  م-ی
>> > > > > > > > > > > 
>> > > > > > > > > > > when i do search for  "د-ژ" 
range the results
>> > > are wrong ,
>> > > > > they
>> > > > > > > > > > > are the results of  " س-ظ "range.
>> > > > > > > > > > > 
>> > > > > > > > > > > for example when i do search for
"د-ژ"
>> one of the the
>> > > results
>> > > > > > > > > > > is "ساب ووفر", this result
also shown on the "
>> > > س-ظ " range's
>> > > > > result
>> > > > > > > > > > > list which is the corret range.
>> > > > > > > > > > > 
>> > > > > > > > > > > As IndexSearcher use "compareTo"
method
>> and this method
>> > > uses
>> > > > > > > > > > > unicodes for comparing, i found
the unicodes of the
>> characters.
>> > > > > > > > > > > 
>> > > > > > > > > > > د=U+62F
>> > > > > > > > > > > ژ = U+698
>> > > > > > > > > > > and the first letter of "ساب
ووفر " is  س = U+633
>> > > > > > > > > > 
>> > > > > > > > > > It appears to me that *both* the "د-ژ"
range [
>> > > > > U+062F - U+0698 ]
>> > > > > > > and
>> > > > > > > > > > the "س-ظ" range [ U+0633 - U+0638
] contain the
>> > > > > first letter of
>> > > > > > > "ساب
>> > > > > > > > > > ووفر", which is "س" = U+0633.
>> > > > > > > > > > 
>> > > > > > > > > > You stated that U+0633 should be contained
in the [
>> > > > > U+0633 - U+0638
>> > > > > > > ]
>> > > > > > > > > > range - I agree - but why do you think
U+0633 should
>> not be
>> > > > > > > > > > contained in the [ U+062F - U+0698 ]
range?
>> > > > > > > > > > 
>> > > > > > > > > > In other words, it looks to me like
your problem is
>> > > > > not a problem
>> > > > > > > at
>> > > > > > > > > > all.
>> > > > > > > > > > 
>> > > > > > > > > > Steve
>> > > > > > > > > > 
>> > > > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > -- View this message in context:
>> > > > > > > > > 
>> > > > > > > 
>> > > > > > >
>> http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
>> > > > > > > .html Sent
>> > > > > > > > from the Lucene - Java Users mailing list archive
at
>> Nabble.com.
>> > > > > > > > 
>> > > > > > > > 
>> > > > > > > > 
>> > > > > 
>> > > 
>> ---------------------------------------------------------------------
>> > > > > > > To
>> > > > > > > > unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> > > For
>> > > > > > > > additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> > > > > > > > 
>> > > > > > > > 
>> > > > > > > 
>> > > > > > > 
>> > > > > > > 
>> > > > > > > 
>> > > > > > > 
>> > > > > > > 
>> > > > > > > -- View this message in context:
>> > > > > > 
>> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
>> > > > > >  Sent from the Lucene - Java Users mailing list archive
at
>> > > > > Nabble.com.
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > 
>> > > 
>> ---------------------------------------------------------------------
>> > > > > >  To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> > > > > >  For additional commands, e-mail:
>> > > java-user-help@lucene.apache.org
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > 
>> > > > > -- View this message in context:
>> > > > > 
>> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
>> > > .html Sent
>> > > > from the Lucene - Java Users mailing list archive at Nabble.com.
>> > > > 
>> > > > 
>> > > > 
>> ---------------------------------------------------------------------
>> > > To
>> > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>> > > > additional commands, e-mail: java-user-help@lucene.apache.org
>> > > > 
>> > > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > 
>> >  -- View this message in context:
>> >  http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
>> >  Sent from the Lucene - Java Users mailing list archive at
>> Nabble.com.
>> > 
>> > 
>> > 
>> ---------------------------------------------------------------------
>> >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >  For additional commands, e-mail: java-user-help@lucene.apache.org
>> > 
>> > 
>> > 
>> > 
>> > 
>> 
>> -- View this message in context:
>> http://www.nabble.com/lucene-farsi-problem-tp16977096p17080852.html Sent
>> from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
>> additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>>
> 
>  
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17165550.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message