lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion
Date Sat, 13 Sep 2008 17:46:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630806#action_12630806
] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

{quote}
from the Collator javadocs:
bq. When sorting a list of Strings however, it is generally necessary to compare each String
multiple times. In this case, CollationKeys provide better performance. The CollationKey class
converts a String to a series of bits that can be compared bitwise against other CollationKeys.
A CollationKey is created by a Collator object for a given String. 

I don't think we need to implement this now, but I wonder if there is a performance difference
if we created the CollationKey for comparison. The big question is whether the construction
of that for each term outweighs the savings by repeated comparisons to lower and upper.
{quote}

I think the problem is that every single index term has to be converted to a CollationKey
for every single (range) search.  In an earlier comment on this issue, Hoss said:

bq. 4) when i first saw the thread that spawned this issue, my first reaction was to wonder
if it would make sense to start allowing a Collator to be specified when indexing, and to
use the raw bytes from the CollationKey as the indexed value - I haven't thought it through
very hard, but i wonder if that would be feasible (it seems like it would certainly faster
at query time, since it would allow more traditional term skipping.

I'm working on a utility class to store arbitrary binary in sortable, indexable Strings, so
that CollationKeys can be stored in the index.  IMHO, though, this issue should still go forward.

bq. One more question, and it probably shows my lack of knowledge here, but would it be possible
to enumerate the various codepoints where there are problems and just handle these separately,
somehow? Basically, how pervasive is the problem? Would we perhaps be better off having a
check to see if one of these bad codepoints falls in the range of lower/upper and then handle
it separately?

Languages, in some cases using the same character repertoire, define different orderings.
 Also, I believe some orderings are context dependent - you can't always compare character
by character.   So adding this stuff to Lucene would be to duplicate a lot of the stuff that's
already done in the Collator.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html]
of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator
and/or CollationKey's, to handle ranges for languages which have alphabet orderings different
from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message