lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
Date Tue, 11 Nov 2008 21:53:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646679#action_12646679
] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

bq. And wouldn't there be times when you also want to reverse the encoding? EG if you enum
all terms for presentation (maybe as part of faceted search for example)?

AFAIK, CollationKey generation is a one-way operation.  If the original terms are required
for presentation, they can be stored, right?

{quote}
Here are some pros of internal-to-indexing:
      [...]
    - Real terms are stored in the index - tools like Luke can look at
      the index and see normal looking terms. Though... I don't have a
      sense of what the encoded term would look like - maybe it's not
      that different from the original in practice?
{quote}

IndexableBinaryStringTools (LUCENE-1434) implements a base-8000h encoding: the lower 15 bits
of each character have 1-7/8 bytes packed into them.  It's radically different from the original
byte array, at least in terms of looking at it with a text viewer like Luke.  And I don't
think CollationKeys themselves are intended for human consumption.

{quote}
bq. In the current code base, for range searching on a collated field, every single term has
to be collated with the search term. This patch allows skipTo to function when using collation,
potentially providing a significant speedup.

Both the original proposed approach (external-to-indexing) and this
internal-to-indexing approach would solve this, right? Ie, in both
cases the terms have been sorted according to the Collator, but in the
internal-to-indexing case it's the original String term stored in the
terms dict.
{quote}

Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term
using String.compareTo(), so regardless of the index term dictionary ordering, skipTo() won't
necessarily stop at the correct location, right?  From TermEnum.java:

{code:java}
  public boolean skipTo(Term target) throws IOException {
     do {
        if (!next())
  	        return false;
     } while (target.compareTo(term()) > 0);
     return true;
  }
{code}

and here's o.a.l.index.Term.compareTo(Term):

{code:java}
  public final int compareTo(Term other) {
    if (field == other.field)			  // fields are interned
      return text.compareTo(other.text);
    else
      return field.compareTo(other.field);
  }
{code}


> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes
the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation
for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message