lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Test code for regex queries
Date Thu, 24 Nov 2005 16:57:06 GMT
On Thursday 24 November 2005 10:25, Erik Hatcher wrote:
> 
> On 24 Nov 2005, at 03:17, Paul Elschot wrote:
> >> I must admit that I haven't used the surround parser.  For my custom
> >> parser (a legacy syntax that no one here would want), I take any term
> >> that has an *, ?, or [...] syntax as a regex term.
> >
> > I had another look at the javadocs of java regex package.
> > The normal brackets in a regex are not needed for queries, so they
> > can be left as they are.
> 
> I don't understand what you mean about brackets not being needed.   
> Brackets certainly factor in greatly into regular expressions:
> 
> 	<http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html>
> 
> What am I missing?

Capturing groups and special contexts need normal brackets ().
Capturing groups are used for replacements, and I don't see a use
for that in a query language.
Special constructs with () brackets are used for non capturing groups,
match flags, and lookahead/lookbehind.
Would you know a use for these in a query language?
 
> >> There are still some TODO's with the (Span)RegexQuery - such as being
> >> wise about the prefix length.  Right now it is not wise enough.  I've
> >> spent some time looking for a regex parser that could parse a regex
> >> expression into an AST so that it could be used for determining the
> >> last static character to start term enumeration.  This would also
> >> come in very handy in being able to rotate a regular expression
> >> string to maximize the static prefix when indexing with an analyzer
> >> that rotates terms.  If anyone has suggestions/pointers to how this
> >> could be accomplished, it'd be most appreciated!
> >
> > I think I'll simply treat each term as a potential regex and
> > use alphanumeric characters for the prefix. I'll try and leave
> > parsing of the regex to the java regex package as much as possible.
> 
> Simply using alphanumeric to determine the prefix doesn't work in all  
> cases though.  Here's an example I just added, commented out, to  
> TestRegexQuery:
> 
>   assertEquals(1, regexQueryNrHits("r?over"));
> 
> The 'r' is optional, yet it is chosen as a prefix and thus no  
> documents match even though "over" does match that regex.  The quick  

You're right, I missed the suffix meaning of ? .

> fix for this is, like FuzzyQuery, to enumerate all terms blindly.   
> Another option is to allow the user of (Span)RegexQuery to provide  
> the prefix, at least pushing the burden back a bit.
> 
> I still think a regex parser would be mighty useful for this issue,  
> as well as the next...
> 
> > Rotating from the suffix should also be straightforward for
> > alphanumerical chars.
> 
> It's not as trivial as this, if you want the rotated prefixes to be  
> maximized.  Take, for example, the regex expression
> 
> 	ThisContainsTwo[abc]RegexPieces.*Total
>
> Using rotation to maximize the prefix and speeding up term  
> enumeration a calculation of of the maximum non-regex piece is  
> needed, including a calculation on whether the head and tail combined  
> make a larger prefix.  For example, using '$' to denote the end of  
> the string, the rotated version of this should be:
> 
> 	Total$ThisContainsTwo[abc]RegexPieces.*
> 
> With a regex parse tree, it should be possible to be wise about what  
> is a static prefix and to compute the size of all the static pieces  
> allowing for clever rotation to make regex queries as efficient as  
> possible.  Now where is that regex parser grammar?!  :/

I also missed things like \u2014, which only add to the problem.

There are some older regex implementations in java, but I
have no idea about the licences and the availabiility.
Doesn't apache have one somewhere?

Without a regex parser a prefix regex operator like
PRE(prefix,regex)
in the surround parser would just do what is needed,
except for the comma and the brackets, and a regular expression
can work around these by using \xhh or \0nn .
Btw. $ also has a special meaning in regexes.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message