xerces-j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Glavassevich <mrgla...@ca.ibm.com>
Subject Re: More on (Re: (XERCESJ-589) Bug with pattern restriction on long strings)
Date Mon, 09 Jul 2007 03:55:29 GMT
Hi Geoff,

"Geoff M. Granum" <geoff.granum@lookingglasssoftware.com> wrote on 
07/05/2007 10:17:42 PM:

> Hello Michael,
> Everything works so far, with the exception of UNION, which technically 
> works fine but exposes an infinite loop bug which was being broken by a 
> stack overflow before.
> The regex: (((((boy)|(girl))[0-1][x-z]{2})?)|(man|woman)[0-1]?[y|n])*
> With Target: boy0xxwoman1ygirl1xyman
> Loops forever (the ending is invalid, needs [0-1]?[y|n] ). Well, forever 
> being 'until you run out of memory'. I didn't notice this until I set my 
> memory back to a reasonable size, as I had the -Xmx flag set to a GB 
> earlier for some playing.
> So the options are
> a. ) Fix it. Being a serious edge case, I'm thinking this isn't really 
> worth the risk. More bluntly, it would take me far more time than I want 
> to spend on it to become comfortable with any fix I might produce.
> b. ) Ignore the case (let the system die of OOM *eventually*, 
> your CPU will be pegged for a while before this happens)
> c. ) Throw an exception if the stack depth gets to some crazy level (a 
> million deep, for instance), perhaps adding a System variable to set the 
> allowable depth.
> d. ) Other?

I would go with option d: open a new JIRA issue [1] to make folks aware 
that this is a bug. Perhaps someone will fix it one day.

> For reference, using the standard JVM (no flags, 64MB heap space, IIRC), 
> can check the DNA string appended to itself ~425 times before reaching a 
> depth of a million stack elements. DNA_STRING is 2273 characters long.
> I run out of memory (heap space) around 650*DNA_STRING, with a depth of 
> 1,477,450. (the string being 1,477,450 characters long... big shock).

Cool. That's leaps and bounds better than the current code. I think folks 
were getting the stack overflow with around 2000 to 3000 characters 
(without increasing the stack size from the default).

> DNA String runs out of memory very quickly, whereas the above regex is 
> pretty slow about it because it pushes and pops a huge number of stack 
> items, *eventually* achieving an overflow. The DNA string pushes items 
> until it gets an answer, then backs out in the same order.
> Also, and oddly, the 'CAPTURE' option doesn't get hit a single time in 
> 7000 or so regex tests provided by the W3C group. Testing it with the 
> suite, which takes a painfully long time.
> I'm moving into optimization mode now that the regex tests check out; 
> I hear back as to how the group wants the new bug handled, I'll 
> it and post the code for review.
> Oh, did you want the Test Suite code I had to implement? There's a lot 
> generics code in it, and it's incredibly hackish, but it's free to 
> wants it. It is by no means robust nor complete. IntelliJ has an 'Export 
> to Eclipse' setting for modules, if that interests you.

If your test suite can be back-ported to Java 1.3 perhaps it could be 
included with the other sanity/unit tests which run off the build.xml 
'test' target.

> Cheers,
> -- 
> Geoff M. Granum
> Portland, Oregon


[1] http://issues.apache.org/jira/browse/XERCESJ

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

View raw message