lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SpanRegex speed
Date Thu, 31 Aug 2006 02:05:09 GMT
Erik Hatcher wrote:
>
> On Aug 30, 2006, at 6:13 PM, Mark Miller wrote:
>> * An implementation tying Java's built-in java.util.regex to RegexQuery.
>> *
>> * Note that because this implementation currently only returns null from
>> * {@link #prefix} that queries using this implementation will 
>> enumerate and
>> * attempt to {@link #match} each term for the specified field in the 
>> index.
>>
>> Is this another way to say im gonna be friggen slow? Say it aint so...
>
> "slow" is relative.  It will enumerate all the terms for the specified 
> field and run a regular expression match on each one.  The same thing 
> happens with FuzzyQuery and prefixed WildcardQuery too.  These aren't 
> necessarily "slow", so try it and see.
>
>> I want to use this as a multi-phrase query...a spannear with a term 
>> that could be the regex "term1|term2"
>
> What about nesting a SpanOrQuery for those two terms inside a 
> SpanNearQuery?
>
>> I need this. Pipe dream for speed on a huge index?
>
> Feel free to implement a robust prefix method :)  It's much more 
> difficult than I wanted to tackle when I created this infrastructure.  
> But thankfully Regexp implemented it, so you could use it for prefix 
> computation and a different matcher implementation if you like.
>
>     Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Thanks for the info Erik. I did not realize that WildcardQuery and 
FuzzyQuery did this as well. A lot of my concern was that I needed to 
implement WildcardQuery as a SpanRegexQuery so that I could get nested 
wildcard searches in my proximity searches. If it's the same speed as 
WildcardQuery I am not worried. However, it seems like it could be even 
faster:

I only need to support * and ? as wildcard does. I don't want to include 
Jakarta regex with my distro. I made a new Regex implementation based on 
the Java 5 util stuff that only allows * and ?.

I pass the pattern string into a short method that:
     * Removes single backslashes, halves double backslashes, escapes
     * non-alphanumeric, and records prefix. Ignores * and ?.

Then I replace * with .* and ? with *{1}.

Only supporting * and ? seems to make grabbing the prefix nice and simple.

Now my question: should I use this instead of wildcardquery even when 
not in a span search? Sounds like it would be more efficient.
A
lso, how does a spanOr query work? Is the resulting span anchored at the 
start of the word and the length of the word? Like a term span? So that 
its an Or Term span? If there are more than one matches does the span 
cover all of them or is each match a span the size of each hit?

Thanks,

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message