lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-1321) Support for efficient leading wildcards search
Date Fri, 31 Jul 2009 17:13:15 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737601#action_12737601
] 

Robert Muir edited comment on SOLR-1321 at 7/31/09 10:11 AM:
-------------------------------------------------------------

andrzej i see what you are saying. I think its a great feature the way it is!

{noformat}
In the future I will take a look at finding a way to do both, this way complex cases like
*abcde?f get reversed by this feature into \u0001f?edcba*, 
but then implemented with automaton so that it doesn't have to enumerate all tokens that start
with \u0001f. 
{noformat}

this is bad example hope you see what i mean.  the biggest challenge would be preventing suboptimal
cases, like reversing g?abcde* into \u2001*edcba?g, (at least I think).
the first is actually more efficient, I think regardless of the wildcard impl.

I wonder if in your patch you could have an additional check, if something is in the 1st position
but the last character is also a wildcard, not to reverse it?
in the example above even with the default lucene wildcard query, at least it would only enumerate
the tokens starting with g, so its better not to reverse it.

if its in the 0th position it doesnt matter if you reverse it or not but I think that one
case can be optimized.

Thanks,
Robert

      was (Author: rcmuir):
    andrzej i see what you are saying. I think its a great feature the way it is!

In the future I will take a look at finding a way to do both, this way complex cases like
*abcde?f get reversed by this feature into \u0001f?edcba*, 
but then implemented with automaton so that it doesn't have to enumerate all tokens that start
with \u0001f. 

this is bad example hope you see what i mean.  the biggest challenge would be preventing suboptimal
cases, like reversing g?abcde* into \u2001*edcba?g, (at least I think).
the first is actually more efficient, I think regardless of the wildcard impl.

I wonder if in your patch you could have an additional check, if something is in the 1st position
but the last character is also a wildcard, not to reverse it?
in the example above even with the default lucene wildcard query, at least it would only enumerate
the tokens starting with g, so its better not to reverse it.

if its in the 0th position it doesnt matter if you reverse it or not but I think that one
case can be optimized.

Thanks,
Robert
  
> Support for efficient leading wildcards search
> ----------------------------------------------
>
>                 Key: SOLR-1321
>                 URL: https://issues.apache.org/jira/browse/SOLR-1321
>             Project: Solr
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.4
>
>         Attachments: wildcards.patch
>
>
> This patch is an implementation of the "reversed tokens" strategy for efficient leading
wildcards queries.
> ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional)
and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a
marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g.
"DNA" would become "and", thus colliding with the regular term "and", but with the marker
character it becomes "\u0001and".
> This TokenFilter can be added to the analyzer chain that it used during indexing.
> SolrQueryParser has been modified to detect the presence of such fields in the current
schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects
a map of fields where these reversed tokens are indexed. If there is at least one such field,
it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in
getWildcardQuery) the term text may be optionally reversed to put wildcards further along
the term text. This happens when the field uses the reversing filter during indexing (as detected
above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise
the term text is processed as before, i.e. turned into a regular wildcard query.
> Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message