lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
Date Tue, 11 Nov 2008 21:19:46 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646667#action_12646667
] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------


bq. Are you suggesting to not store collation keys in the index?

Right, I'm proposing storing the original Strings, but sorted
according Collator.compare (for that one field), in the Terms dict.

bq. The query-time process in this patch is not the reverse - it is exactly the same.

OK got it.  Where/how would you implement the query time conversion of
terms?

And wouldn't there be times when you also want to reverse the
encoding?  EG if you enum all terms for presentation (maybe as part of
faceted search for example)?

bq. In the current code base, for range searching on a collated field, every single term has
to be collated with the search term. This patch allows skipTo to function when using collation,
potentially providing a significant speedup.

Both the original proposed approach (external-to-indexing) and this
internal-to-indexing approach would solve this, right?  Ie, in both
cases the terms have been sorted according to the Collator, but in the
internal-to-indexing case it's the original String term stored in the
terms dict.

Here are some pros of internal-to-indexing:

  - You don't have to convert every single term visited during
    analysis first to a CollationKey then ByteBuffer then encoded
    binary string.  Indexing throughput should be faster?  (Though,
    when writing the segment you do need to sort using
    Collator.compare, which I guess could be slow).

  - Real terms are stored in the index -- tools like Luke can look at
    the index and see normal looking terms.  Though... I don't have a
    sense of what the encoded term would look like -- maybe it's not
    that different from the original in practice?

  - Querying would just work without term conversion

And some cons:

  - It's obviously a more invasive change to Lucene (and probably
    should go after the flex-indexing changes).  The
    external-to-indexing approach is nicely externalized.

  - Performance -- the binary search of the terms index would be
    slower using Collator.compare instead of String.compareTo (though
    I would expect this to be minimal in practice).

I'm sure there are many pros/cons I'm missing...


> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes
the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation
for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message