lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Test code for regex queries
Date Fri, 25 Nov 2005 10:14:02 GMT

On 24 Nov 2005, at 20:26, Erik Hatcher wrote:
>> There are some older regex implementations in java, but I
>> have no idea about the licences and the availabiility.
>> Doesn't apache have one somewhere?
>
> Two actually!  ORO and Regexp.  Here's ORO - <http:// 
> jakarta.apache.org/oro/> (link to Regexp from there)
>
> I'll dig into those soon and see what useful goodies lurk within.

 From perusing the API via Javadocs, Regexp mentioned just what we  
need, but I didn't see the same sort of thing with ORO.  So I pulled  
down Jakarta Regexp and dropped it in.  I had to add a getter for a  
package protected internal "prefix" to REProgram, but once I did  
that, here are some passing tests...

     assertEquals(1, getPrefix("a[bc]*"));
     assertEquals(2, getPrefix("a\\$[bc]*"));
     assertEquals(0, getPrefix("r?over"));


   private int getPrefix(String expression) {
     REProgram program = new RECompiler().compile(expression);
     char[] prefix = program.getPrefix();
     return prefix == null ? 0 : prefix.length;
   }

Quite promising!  The REProgram has the full parse tree as  
"instructions", so it'd be possible to use this for clever rotation  
also, I believe.  I'm sure Regexp doesn't support the full Perl5  
syntax that Java's regex package does, but it seems to be good enough  
for the basic regex syntax.

A couple of issues... 1) to use this additional library, (Span) 
RegexQuery should be pulled into contrib/regex, 2) It'd be a little  
awkward to use Jakarta Regexp to determine the prefix and potentially  
be used for rotation logic, and then use JDK regex for the actual  
matching.  I have no data to say which has faster matching, or  
another pros/cons, just that it could potentially mismatch.  I'm  
inclined to swap completely to Jakarta Regexp for matching as well,  
at least for the time being in order to keep things in sync and  
benefit from more clever term enumeration.  The time saved in term  
enumeration seems likely to more than make up for matching speed  
differences.

Thoughts?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message