jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Fairchild <fairch...@counterpane.com>
Subject Jakarta-ORO experiments in PERL5 and REGEXs.
Date Tue, 18 Nov 2003 03:45:39 GMT
Greetings.

I've been experimenting with Perl5 REGEXs which cause previously
reported issues with matching, namely, infinite-loops, and
java.lang.stackOverflowException(s).  I've found some fairly short
regular expressions and input which seem to demonstrate the danger of
particularly odd or poorly written REGEXs.

I don't have immediate access to PERL 5.0003, but I am using 5.005,
5.008, and 5.8.0.  None of these versions of PERL produce exactly the
same results as the Jakarta-ORO matcher.  Each REGEXP compiles, and
matches or does match in 6-30ms consistently (minus the odd match time
in the thousands of ms.)

I've tested this test program with jakarta-oro 2.0 -> 2.0.7, JRE 1.3.1
-> 1.4.1, and linux 6.2, 7.0, and 9.0, and the same results apply.

I realize that these are NOT well written regular expressions, but this
whole experiment came about by having some hand me back my Java classes,
saying that my code hangs.  I debugged down to the Perl5Matcher.match()
call in my code, and then wrote my test program.  When stuff started
blowing up, I started looking through the oro-dev archives, and I
thought I would contribute to the "DON'T DO THIS" list.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

ORO will generate the stackOverflow exception using the following:

INPUT STRING: "X0"

REGEX 1:	"X(\d*)*X"
REGEX 2:	"X(\d*)+X"
REGEX 3:	"X(\d*\s*)*X"
REGEX 4:	"X(\d*\s*)+X"
REGEX 5:	"X([\d-]*\s*)*X"
REGEX 6:	"X([\d-]*\s*)+X"

Minor mods, like "\s*" to "\s+" and "\d*" to "\d+" resolve this
problem.  Also, changing the input string to "X0X" causes a match, but
changing it to "X0-" still causes some of the REGEXs to fail, while
others work just fine.  See e-mail:
	From: Pavel Sher
	Subject: StackOverflowException 
	Date: Fri, 19 Sep 2003 10:00:43 -0700 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Here is an example of the output from the test program.  What you
might notice is that once the input string has reached a certain
size, every new character appended to the input string doubles the
match time.  See [Bug 6838], [Bug 14850], and [Bug 23132]:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

$ ./RegexLoop1.sh 

RegexLoop Version 1 (Linux 2.2.16-22smp [i386]) (Sun Microsystems Inc.
JRE 1.4.0_01-b03) (Jakarta ORO 2.0.7)

Compiling 'X([\d-]+)+X' ... DONE!

Matching 'X' ... DONE! false (2ms)
Matching 'X-' ... DONE! false (0ms)
Matching 'X-1' ... DONE! false (0ms)
Matching 'X-1-' ... DONE! false (0ms)
Matching 'X-1-1' ... DONE! false (0ms)
Matching 'X-1-12' ... DONE! false (1ms)
Matching 'X-1-12-' ... DONE! false (1ms)
Matching 'X-1-12-1' ... DONE! false (79ms)
Matching 'X-1-12-12' ... DONE! false (4ms)
Matching 'X-1-12-123' ... DONE! false (25ms)
Matching 'X-1-12-123-' ... DONE! false (57ms)
Matching 'X-1-12-123-1' ... DONE! false (9ms)
Matching 'X-1-12-123-12' ... DONE! false (7ms)
Matching 'X-1-12-123-123' ... DONE! false (17ms)
Matching 'X-1-12-123-1234' ... DONE! false (30ms)
Matching 'X-1-12-123-1234-' ... DONE! false (73ms)
Matching 'X-1-12-123-1234-1' ... DONE! false (124ms)
Matching 'X-1-12-123-1234-12' ... DONE! false (236ms)
Matching 'X-1-12-123-1234-123' ... DONE! false (475ms)
Matching 'X-1-12-123-1234-1234' ... DONE! false (946ms)
Matching 'X-1-12-123-1234-12345' ... DONE! false (1905ms)
Matching 'X-1-12-123-1234-12345-' ... DONE! false (3861ms)
Matching 'X-1-12-123-1234-12345-1' ... DONE! false (8130ms)
Matching 'X-1-12-123-1234-12345-12' ... DONE! false (16789ms)
Matching 'X-1-12-123-1234-12345-123' ... DONE! false (32375ms)
Matching 'X-1-12-123-1234-12345-1234' ... DONE! false (64900ms)
Matching 'X-1-12-123-1234-12345-12345' ... DONE! false (132538ms)
Matching 'X-1-12-123-1234-12345-123456' ... DONE! false (263073ms)
Matching 'X-1-12-123-1234-12345-123456-' ... DONE! false (529350ms)
Matching 'X-1-12-123-1234-12345-123456-1' ... ^C
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

$ ./RegexLoop1.sh -bad

RegexLoop Version 1 (Linux 2.2.16-22smp [i386]) (Sun Microsystems Inc.
JRE 1.4.0_01-b03) (Jakarta ORO 2.0.7)

Compiling 'X([\d-]*)*X' ... DONE!

Matching 'X' ... DONE! false (4ms)
Matching 'X-' ... Exception in thread "main" java.lang.StackOverflowError
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1296)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
        at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
etc..
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Here is my test program:

/*
----------------------------------------------------------------------------
-
 * FILE: RegexLoop1.java
 * AUTH: Nathan E. Fairchild
 * DATE: Nov-14-2003
 *
 * Example program to illustrate the performance decay with a simple/bad
REGEXP
 * and a special set of strings.  Run with -bad, or -bad2 and watch it blow
up.
 *
 * Testing shell script:
 *
 * ---- RegexLoop1.sh
----------------------------------------------------------
 * #!/bin/bash
 * 
 * CLASSPATH=${JAKARTA_HOME}/jakarta-oro-2.0.X.jar;${JAVA_HOME}:./
 *
 * case $1 in
 *  -compile)   ${JAVA_HOME}/bin/javac -classpath ${CLASSPATH}
RegexLoop1.java
 *              ;;
 *         *)   ${JAVA_HOME}/bin/java -cp ${CLASSPATH} RegexLoop1 $@
 *              ;;
 * esac
 * exit $?
 * ---- RegexLoop1.sh
----------------------------------------------------------
 *
 *
--------------------------------------------------------------------------
*/
import java.lang.String;
import java.util.Iterator;
import java.util.StringTokenizer;

// Used to pull info from Manifest in jar files
import java.util.jar.JarFile;
import java.util.jar.Attributes;

// The matching classes for this test
import org.apache.oro.text.regex.Perl5Pattern;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.Perl5Compiler;

public class RegexLoop1 {

        // We need a matcher to match the message against the pattern
        private static final Perl5Matcher matcher = new Perl5Matcher();

        // We need a compiler to generate the pattern from the REGEXP
        private static final Perl5Compiler compiler = new Perl5Compiler();

        // We need a pattern to hold the compiled REGEXP
        private Perl5Pattern pattern = null;

        // We need the problem REGEXPs
        private static final String _regexps[] = {
                "X([\\d-]+)+X",
                "X([\\d-]*)*X",         // java.lang.StackOverflowError
                "X([\\d-]*)+X",         // java.lang.StackOverflowError
        };

        // We need a message to excercise the bug (either one)
        //private static final String _msg =
"X-1-11-111-1111-11111-111111-1111111-11111111";
        private static final String _msg =
"X-1-12-123-1234-12345-123456-1234567-12345678";

        /**
         * Initializes the built-in REGEXP.
         *
         * @throws Exception because I don't really care about getting
specific.
         */
        public RegexLoop1( int reNum ) throws Exception {
                String re = _regexps[reNum];
                System.err.print( "\nCompiling '" + re + "' ... " );
                pattern = (Perl5Pattern)compiler.compile( re,
Perl5Compiler.SINGLELINE_MASK );
                System.err.println( "DONE!\n" );
        }

        /**
         * Return the target message string.
         */
        public String getMsg() { return _msg; }

        /**
         * Test a match of the given msg against the built-in REGEXP.
         */
        public boolean match( String msg ) {
                System.err.print( "Matching '" + msg + "' ... " );
                long startTime = System.currentTimeMillis();
                boolean res = matcher.contains( msg, pattern );
                long endTime = System.currentTimeMillis();
                System.err.println( "DONE! " + res + " (" + (endTime -
startTime) + "ms)");
                return res;
        }

        /**
         * Self-document.
         */
        public static void identify() {
                //  Display some basic runtime information
                System.err.print( "\nRegexLoop Version 1" );
                System.err.print( " (" + System.getProperty("os.name") );
                System.err.print( " " + System.getProperty("os.version") );
                System.err.print( " [" + System.getProperty("os.arch") +
"])" );
                System.err.print( " (" + System.getProperty("java.vendor")
);
                System.err.print( " JRE " +
System.getProperty("java.runtime.version") + ")" );

                // Grab the path of the ORO matcher jar file
                StringTokenizer paths = new StringTokenizer(
                                System.getProperty("java.class.path"),
                                System.getProperty("path.separator") );
                String jarPath = null;
                while( paths.hasMoreTokens() ) {
                        String p = paths.nextToken();
                        if( p.indexOf("jakarta-oro-") > -1) {
                                jarPath = p;
                                break;
                        }
                }

                // Query the manifest for title and version
                if( jarPath != null ) {
                        try {
                                JarFile jar = new JarFile( jarPath );
                                Attributes attrs =
jar.getManifest().getAttributes("org/apache/oro");
                                String specTitle =
attrs.getValue(Attributes.Name.SPECIFICATION_TITLE);
                                String specVersion =
attrs.getValue(Attributes.Name.SPECIFICATION_VERSION);
                                System.err.print( " (" + specTitle + " " +
specVersion + ")" );
                        } catch( Exception e ) {
                                StringTokenizer parts = new StringTokenizer(
jarPath,
 
System.getProperty("file.separator") );
                                String jar = null;
                                while( parts.hasMoreTokens() ) {
                                        jar = parts.nextToken();
                                }
                                System.err.print( " (" + jar + ") " );
                        }
                }
                System.err.println( "" );
        }

        /* MAIN
------------------------------------------------------------- */
        public static void main( String args[] ) {

                // Identify the working environment
                identify();

                // Check for the -bad command-line argument
                int useRe = 0;
                if( args.length > 0 ) {
                        if( "-bad".equals(args[0]) )
                                useRe = 1;
                        else if( "-bad2".equals(args[0]) )
                                useRe = 2;
                }

                // Create a new looper
                RegexLoop1 looper = null;
                try {
                        looper = new RegexLoop1( useRe );
                } catch( Exception e ) {
                        System.err.println( e );
                        System.exit( 1 );
                }

                // Match strings of increasing size
                int len = looper.getMsg().length();
                for( int i = 1; i <= len; i++ ) {
                        String msg = looper.getMsg().substring(0, i);
                        looper.match( msg );
                }
        }
        /* MAIN
------------------------------------------------------------- */
}

Here is some Perl output of the same REGEX matching:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

$ fail2.pl 
Fail Version 2 (linux 2.2.16-22smp [i686]) (Perl 5.006)

Compiling 'X([\d-]+)+X' ... DONE!

Matching 'X' ... DONE! false (0.000031)
Matching 'X-' ... DONE! false (0.000006)
Matching 'X-1' ... DONE! false (0.000015)
Matching 'X-1-' ... DONE! false (0.000007)
Matching 'X-1-1' ... DONE! false (0.000006)
Matching 'X-1-12' ... DONE! false (0.000006)
Matching 'X-1-12-' ... DONE! false (0.000023)
Matching 'X-1-12-1' ... DONE! false (0.000008)
Matching 'X-1-12-12' ... DONE! false (0.000007)
Matching 'X-1-12-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-12' ... DONE! false (0.000007)
Matching 'X-1-12-123-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234' ... DONE! false (0.001725)
Matching 'X-1-12-123-1234-' ... DONE! false (0.000010)
Matching 'X-1-12-123-1234-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-123' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-1' ... DONE! false (0.000012)
Matching 'X-1-12-123-1234-12345-12' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-12345' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-1' ... DONE! false (0.003880)
Matching 'X-1-12-123-1234-12345-123456-12' ... DONE! false (0.000008)
Matching 'X-1-12-123-1234-12345-123456-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-12345' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-123456' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567' ... DONE! false (0.001842)
Matching 'X-1-12-123-1234-12345-123456-1234567-' ... DONE! false (0.000015)
Matching 'X-1-12-123-1234-12345-123456-1234567-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-12' ... DONE! false
(0.000006)
Matching 'X-1-12-123-1234-12345-123456-1234567-123' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-1234' ... DONE! false
(0.000024)
Matching 'X-1-12-123-1234-12345-123456-1234567-12345' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-123456' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-1234567' ... DONE! false
(0.000021)
Matching 'X-1-12-123-1234-12345-123456-1234567-12345678' ... DONE! false
(0.000008)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


-Nathan

---------------------------------------------------------------------
To unsubscribe, e-mail: oro-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-dev-help@jakarta.apache.org


Mime
View raw message