lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: lucene farsi problem
Date Sun, 11 May 2008 18:29:35 GMT
Hi Esra,

Did you try the new version of the patch?

In the latest verson, I have taken the code that was in CollatingRangeQuery and put it into
RangeQuery.

I also put the same functionality into RangeFilter, and provided code to call it from ConstantScoreRangeQuery
and QueryParser.  Note that ConstantScoreRangeQuery doesn't have the clause limit restriction
that RangeQuery has (1024 max clauses, IIRC).

Steve

On 05/10/2008 at 1:22 PM, esra wrote:
> 
> Hi Steve,
> 
> i used the locale as "ar" and it works fine .
> 
> again thanks a lot for your help.
> 
> Esra
> 
> 
> Steven A Rowe wrote:
> > 
> > Hi Esra,
> > 
> > On 05/06/2008 at 7:38 AM, esra wrote:
> > > i tried the class and it works fine with the locale parameter "ar".
> > 
> > Cool, I'm glad this addressed your problem!
> > 
> > > Actually we are using "fa" for farsi and "ar" for arabic.
> > > I have added a little control for the locale parameter in my
> > > code and now i can see the correct results.
> > 
> > From what I could tell, the Collator available for Locale("fa") in the
> > Sun 1.4.2 and 1.5.0 JDKs does not contain Farsi character collation,
> > but the Collator available for Locale("ar") *does* contain Farsi
> > collation.  I switched TestCollatingRangeQuery from Locale("fa") to
> > Locale("ar") when I couldn't get the Collator returned for Farsi [ via
> > Collator.getInstance(new Locale("fa") ] to produce correct results.
> > 
> > Did you find that Locale("fa") produces the correct results?  If so,
> > which VM are you using?
> > 
> > At Chris Hostetter's suggestion, I am rewriting the patch attached to
> > LUCENE-1279, including the following changes:
> > 
> > - Merged the contents of the CollatingRangeQuery class into RangeQuery
> > and RangeFilter - Switched the Locale parameter to instead take an
> > instance of Collator - Modified QueryParser.jj to construct a
> > QueryParser class that can accept a range collator and pass it either
> > to RangeQuery or through ConstantScoreRangeQuery to RangeFilter.
> > 
> > I plan on posting the revised patch in the next day or two.
> > 
> > Steve
> > 
> > On 05/06/2008 at 7:38 AM, esra wrote:
> > > 
> > > Hi Steven ,
> > > Hi Steven,
> > > 
> > > i tried the class and it works fine with the locale parameter "ar".
> > > 
> > > Actually we are using "fa" for farsi and "ar" for arabic.
> > > I have added a little control for the locale parameter in my
> > > code and now i can see the correct results.
> > > 
> > > Thank you very much for ypur help.
> > > 
> > > Esra.
> > > 
> > > Steven A Rowe wrote:
> > > > 
> > > > Hi Esra,
> > > > 
> > > > I have attached a patch to LUCENE-1279 containing a new class:
> > > > CollatingRangeQuery.
> > > > 
> > > > The patch also contains a test class: TestCollatingRangeQuery.  One
> > > > of the test methods checks for the Farsi range you were having
> > > > trouble with.
> > > > 
> > > > It should be mentioned that according to
> > > > Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
> > > > contains Farsi collation information. However, in the test class I
> > > > use the Arabic Locale, which seems to properly collate the non-Arabic
> > > > Farsi letter U+0698, and hopefully behaves well with other Farsi
> > > > letters.  If you find that this is not the case, you can look into
> > > > writing collation rules using RuleBasedCollator - you should be able
> > > > to directly specify the proper letter orderings for Farsi;
> > > > CollatingRangeQuery would also have to supply a constructor that
> > > > takes in a Collator instead of a Locale.
> > > > 
> > > > Please give the class a try and post back about how it works.
> > > > 
> > > > Thanks,
> > > > Steve
> > > > 
> > > > On 05/03/2008 at 8:33 AM, esra wrote:
> > > > > 
> > > > > Hi Steven,
> > > > > 
> > > > > thanks for your help....
> > > > > 
> > > > > Esra
> > > > > 
> > > > > 
> > > > > Steven A Rowe wrote:
> > > > > > 
> > > > > > Hi Esra,
> > > > > > 
> > > > > > I have created an issue for this - see
> > > > > > <https://issues.apache.org/jira/browse/LUCENE-1279>.
> > > > > > 
> > > > > > I'll try to take a crack at a patch this weekend.
> > > > > > 
> > > > > > Steve
> > > > > > 
> > > > > > On 05/02/2008 at 12:55 PM, esra wrote:
> > > > > > > 
> > > > > > > Hi Steven ,
> > > > > > > 
> > > > > > > yes you are right, sorry i am a bit confused.
> > > > > > > 
> > > > > > > i checked again and the correct one is  "zhe"/U+698.
> > > > > > > 
> > > > > > > It seems the word is in the range but my customer says
it
> > > > > > > shouldn't be.
> > > > > > > 
> > > > > > > I think problem occurs because  "zhe" is a Persian letter
outside
> > > > > > > the Arabic alphabet. In farsi alphabet this letter is not
after
> > > > > > > the "س" letter but it's unicode is bigger than "س" letter's
and
> > > > > > > the searcher works with unicodes.
> > > > > > > 
> > > > > > > Esra
> > > > > > > 
> > > > > > > 
> > > > > > > Steven A Rowe wrote:
> > > > > > > > 
> > > > > > > > Hi Esra,
> > > > > > > > 
> > > > > > > > You are *still* incorrectly referring to the
> glyph with three
> > > dots
> > > > > > > > over it:
> > > > > > > > 
> > > > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > > > > > 
> > > > > > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > > > > > > > 
> > > > > > > > Have you increased the font size?  Can you see the
difference
> > > > > > > > between these two?:
> > > > > > > > 
> > > > > > > > "ژ"/"zhe"/U+698
> > > > > > > > "ز"/"ze"/U+632
> > > > > > > > 
> > > > > > > > > my problem is when i do search for  "د-ژ" range.
> > > The result is
> > > > > "ساب
> > > > > > > > > ووفر" and this word's first letter is "س"
and it's unicode is
> > > > > > > > > "U+633" and it is not in the in the [ U+062F
-
> > > U+0632 ] range.
> > > > > > > > 
> > > > > > > > Like I keep saying, in the above description, you're
> > > using the
> > > > > glyph
> > > > > > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly
> > > > > > > > referring to it as "ze"/U+632.
> > > > > > > > 
> > > > > > > > I don't mean to continually bang on about this -
> if you're
> > > *sure*
> > > > > > > > that when you search, you're using the character
> represented by
> > > the
> > > > > > > > glyph with one dot (and not three), i.e.
> "ز"/"ze"/U+632, then
> > > the
> > > > > > > > problem lies elsewhere.
> > > > > > > > 
> > > > > > > > Steve
> > > > > > > > 
> > > > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > > > > > > 
> > > > > > > > > my problem is when i do search for  "  د-ژ"
> range. The result
> > > is
> > > > > > > > > ""ساب ووفر " and this word's first letter
is
> "س " and it's
> > > unicode
> > > > > > > > > is "U+633"  and  it is not in the in the [
> U+062F - U+0632 ]
> > > range.
> > > > > > > > > 
> > > > > > > > > am i wrong?
> > > > > > > > > 
> > > > > > > > > Esra
> > > > > > > > > 
> > > > > > > > > Steven A Rowe wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi Esra,
> > > > > > > > > > 
> > > > > > > > > > I still think you're wrong :).
> > > > > > > > > > 
> > > > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > > > > > ژ = U+632
> > > > > > > > > > 
> > > > > > > > > > According to the website you linked to,
the
> > > above character,
> > > > > which
> > > > > > > > > > has three dots over it, is named "zhe",
and its
> > > > > Unicode code point
> > > > > > > is
> > > > > > > > > > U+698. (I had to increase the font size
to
> see the three
> > > dots.)
> > > > > > > > > > 
> > > > > > > > > > I think you are confusing "ژ"/"zhe"/U+698
with the letter
> > > > > > > > > > "ز"/"ze"/U+632, which has just one dot
over it.
> > > > > > > > > > 
> > > > > > > > > > Unless you were mistaken in all of your
emails when
> > > > > you included
> > > > > > > the
> > > > > > > > > > character "ژ"/"zhe" instead of "ز"/"ze",
> then what I said
> > > in my
> > > > > > > > > > previous email still stands: there is no
problem here.
> > > > > > > > > > 
> > > > > > > > > > Steve
> > > > > > > > > > 
> > > > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi Steven,
> > > > > > > > > > > 
> > > > > > > > > > > sorry i made a mistake. unicodes are
like this:
> > > > > > > > > > > 
> > > > > > > > > > > > د=U+62F
> > > > > > > > > > > > ژ = U+632
> > > > > > > > > > > > and the first letter of "ساب
ووفر " is  س = U+633
> > > > > > > > > > > 
> > > > > > > > > > > you can also check them here
> > > > > > > > > > > > 
> > > > > > > 
> http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > > > > > > > 
> > > > > > > > > > > Esra
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Steven A Rowe wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Esra,
> > > > > > > > > > > > 
> > > > > > > > > > > > Going back to the original problem
statement, I
> > > > > see something
> > > > > > > that
> > > > > > > > > > > > looks illogical to me - please
correct me if I'm wrong:
> > > > > > > > > > > > 
> > > > > > > > > > > > On Apr 30, 2008, at 3:21 AM, esra
wrote:
> > > > > > > > > > > > > i am using lucene's "IndexSearcher"
to search
> > > > > the given xml
> > > > > > > by
> > > > > > > > > > > > > keyword which contains farsi
information.
> > > > > while searching i
> > > > > > > use
> > > > > > > > > > > > > ranges like
> > > > > > > > > > > > > 
> > > > > > > > > > > > > آ-ث  |  ج-خ  |  د-ژ
 |  س-ظ  |  ع-ق  | ک-ل  |  م-ی
> > > > > > > > > > > > > 
> > > > > > > > > > > > > when i do search for  "د-ژ"
 range the results
> > > > > are wrong ,
> > > > > > > they
> > > > > > > > > > > > > are the results of  " س-ظ
"range.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > for example when i do search
for "د-ژ"
> > > one of the the
> > > > > results
> > > > > > > > > > > > > is "ساب ووفر", this
result also shown on the "
> > > > > س-ظ " range's
> > > > > > > result
> > > > > > > > > > > > > list which is the corret
range.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > As IndexSearcher use "compareTo"
method
> > > and this method
> > > > > uses
> > > > > > > > > > > > > unicodes for comparing, i
found the
> unicodes of the
> > > characters.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > د=U+62F
> > > > > > > > > > > > > ژ = U+698
> > > > > > > > > > > > > and the first letter of "ساب
ووفر " is
>  س = U+633
> > > > > > > > > > > > 
> > > > > > > > > > > > It appears to me that *both* the
"د-ژ" range [
> > > > > > > U+062F - U+0698 ]
> > > > > > > > > and
> > > > > > > > > > > > the "س-ظ" range [ U+0633 - U+0638
] contain the
> > > > > > > first letter of
> > > > > > > > > "ساب
> > > > > > > > > > > > ووفر", which is "س" = U+0633.
> > > > > > > > > > > > 
> > > > > > > > > > > > You stated that U+0633 should
be
> contained in the [
> > > > > > > U+0633 - U+0638
> > > > > > > > > ]
> > > > > > > > > > > > range - I agree - but why do you
think
> U+0633 should
> > > not be
> > > > > > > > > > > > contained in the [ U+062F - U+0698
] range?
> > > > > > > > > > > > 
> > > > > > > > > > > > In other words, it looks to me
like your
> problem is
> > > > > > > not a problem
> > > > > > > > > at
> > > > > > > > > > > > all.
> > > > > > > > > > > > 
> > > > > > > > > > > > Steve
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > -- View this message in context:
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > > > > > > > .html Sent
> > > > > > > > > > from the Lucene - Java Users mailing list
archive at Nabble.com.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > > > > > To
> > > > > > > > > > unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > For
> > > > > > > > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > -- View this message in context:
> > > > > > > > 
> > > > > 
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> > > > > > > >  Sent from the Lucene - Java Users mailing list
> archive at
> > > > > > > Nabble.com.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > > > >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > >  For additional commands, e-mail:
> > > > > java-user-help@lucene.apache.org
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > -- View this message in context:
> > > > > > > 
> > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
> > > > > .html Sent
> > > > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > > > 
> > > > > > 
> > > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > To
> > > > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For
> > > > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > -- View this message in context:
> > > > 
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
> > > >  Sent from the Lucene - Java Users mailing list archive at
> > > Nabble.com.
> > > > 
> > > > 
> > > > 
> > > 
> ---------------------------------------------------------------------
> > > >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >  For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > -- View this message in context:
> > > 
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17080852
 .html Sent
> > from the Lucene - Java Users mailing list archive at Nabble.com.
> > 
> > 
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> 
> 
> 
> 
> 
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17165550.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
 For additional commands, e-mail: java-user-help@lucene.apache.org

 

Mime
View raw message