jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Kearns" <skea...@hotmail.com>
Subject New Regular Expression Package: JavaRegex2
Date Thu, 01 Jan 1970 00:00:00 GMT

The purpose of this letter is to suggest a package that can serve
as the next generation regexp matcher.  JavaRegex2, as it is
currently called, exists and has been used in production.

I am the author of this library.  I am an expert in regular
expressions, since I did my thesis on "Extending regular
expressions".  I can mostly likely get permission
to open source it but only want to go through all of the
necessary paperwork if it is likely to be
adopted.  Please provide appropriate feedback.

JavaRegex2 offers significant improvements over ORO, which we
now describe.

-------- Full Unicode Support:

character classes can contain full range of unicode characters.
Input patterns can include arbitrary unicode characters.
Can match against input in any character set.

-------- Patterns in external files:

JavaRegex2 supports the option of storing patterns in separate
text files.  The benefit of this is that regular expressions
can be updated without recompiling an application.

-------- Compiled patterns:

Certain ORO patterns may take very long to execute on even
small strings, because ORO uses recursive backtracking.

JavaRegex2 solves this problem by offering a simple switch
which allows a pattern to be compiled into a DFA and thus
executed maximally efficiently, no matter how complicated
or which pattern types are used.  Only one parameter has
to be changed when switching between compiled and interpreted
versions.  In fact, JavaRegex2 currently uses ORO as the
regular expression engine when not in compiled mode.

-------- Precompiled patterns:

When compiled patterns are allowed, the time to compile a
pattern may be several seconds or minutes on large patterns.
For this
reason, it is important to allow patterns to be precompiled
so that they can be loaded and matched efficiently.

JavaRegex2 caches in an external
file the result of compiling a regular expression and
automatically uses the cached version if it is available,
skipping the potentially time consuming compilation process.

-------- Reuseable pattern definitions:

When making large patterns, or to reuse patterns, it is important
to define named patterns.  JavaRegex2 allows patterns to be
defined as a sequence of named patterns, with the one named "main"
being the actual final pattern.  Here is an example:

digit=[0-9]
number=[!digit]+
floatingNumber=[!number](\.[!number]?)?
anystr=.*
main=[!anystr][!floatingNumber]

Furthermore, JavaRegex2 supports an "include" file feature
so that common patterns can be collected in pattern libraries
and included as needed.  For example.

File PatternLibrary1.pat:

digit=[0-9]
number=[!digit]+
floatingNumber=[!number](\.[!number]?)?
anystr=.*

File MyPattern.pat:

include "PatternLibrary1.pat"
main=[!anystr][!floatingNumber]

-------- Labeled Subpatterns for Matching:

ORO and Perl provide information about how a pattern matched by
requiring that the programmer identify the index of the parentheses
surrounding the subpattern of interest.  This is error prone,
and an alteration to the regular expression might cause these
numbers to change.  And when a regex is large or uses named
definitions, it is especially difficult to identify the index
of a particular subpattern.

JavaRegex2 allows a subpattern be Labeled with a unique
string, and then the value that matches it can be retrieved
via this name.  For example:

digit=[0-9]
number=[!digit]+
floatingNumber=[!number=>n1](\.[!number=>n2]?)?
anystr=.*
main=[!anystr][!floatingNumber=>f1]

Notice that several of the definitions have "=>name1" appended.
This labels the associated subpattern as "name1".  Then
information about what matched can be retrieved using requests
for "f1.n1" or "f1.n2" instead of requesting an obscure index.

-------- Maximal backwards compatability

JavaRegex2 offers maximal compatability with the Perl5 regular
expression language, given the few new syntax additions to
support all of the new features.

All Perl5 regular expression language features have been implemented,
except for the few features which are not
possible to implement with a compiled regular expression:

* backreferences (\1, \2) cannot be implemented.
* There is no implementation of (?>pat) pattern.

Also, all the new features do not shoehorn into the ORO
interfaces, so unfortunately users would have to
learn to use a slightly different interface.



_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com


Mime
View raw message