lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Test code for regex queries
Date Thu, 24 Nov 2005 08:17:12 GMT
On Thursday 24 November 2005 00:06, Erik Hatcher wrote:
> 
> On 23 Nov 2005, at 15:42, Paul Elschot wrote:
> > I refactored it to have a few more tests, and all seems to work well.
> > It also includes the tests from TestSpanRegexQuery.java .
> >
...
> 
> > To parse a regex query term, the surround parser will have to
> > be extended a bit so it recognizes a reasonable subset of the
> > java regular expressions.
> > Any preferences for the syntax for a regex term in the
> > surround parser?
> 
> I must admit that I haven't used the surround parser.  For my custom  
> parser (a legacy syntax that no one here would want), I take any term  
> that has an *, ?, or [...] syntax as a regex term.

I had another look at the javadocs of java regex package.
The normal brackets in a regex are not needed for queries, so they
can be left as they are.
All the rest could stay the same, except for the current surround truncation * 
and ? (same as lucene) for which the equivalent regexes are .* and .? .

> There are still some TODO's with the (Span)RegexQuery - such as being  
> wise about the prefix length.  Right now it is not wise enough.  I've  
> spent some time looking for a regex parser that could parse a regex  
> expression into an AST so that it could be used for determining the  
> last static character to start term enumeration.  This would also  
> come in very handy in being able to rotate a regular expression  
> string to maximize the static prefix when indexing with an analyzer  
> that rotates terms.  If anyone has suggestions/pointers to how this  
> could be accomplished, it'd be most appreciated!

I think I'll simply treat each term as a potential regex and
use alphanumeric characters for the prefix. I'll try and leave
parsing of the regex to the java regex package as much as possible.
Rotating from the suffix should also be straightforward for
alphanumerical chars.

I'd like the surround parser to be a power tool that provides
everything that Lucene has under the hood. Regexes fit well,
because they are already used for truncation, only the
truncation syntax will have a dot added as far as I can see now.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message