commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: Anyone interested in regular expressions, again?
Date Sat, 31 Jan 2015 17:29:27 GMT
On Sat, Jan 31, 2015 at 12:22 PM, Bruno P. Kinoshita
<> wrote:
> Hi Benson!
> I wouldn't be able to help at the moment, but some years ago I had a performance issue
in a Nutch crawler with regexes [1] and found about this other library that you mentioned
I think. Are you talking about ORO?

Yes, I believe I'm referring to ORO. I'm not really even looking for
help. I am looking to see if there is enough interest to justify
exploring pushing the code into the ASF. We did benchmarks, it's
faster than built-in Java in a variety of cases. We are precluded from
using GPL, so we didn't look seriously at OpenRegex. We want to have a
system where outsiders can supply any regex they like and we don't
have to worry about one of our servers being eaten by it.

> I ended up changing the regex and never had a chance to play with ORO or other libraries
to see if there was any advantage over not using JRE's regex API. Recently I had another performance
problem with Apache Hive SerDe and performance problems and fixed it by changing the storage
format and simplifying the regex.
> Have you done any performance comparison with your code and other libraries? More or
less like this [2]? Maybe this library could be used as an alternative in Nutch, Commons Crawl
or in other projects when performance was important.
> Lastly, I'm using OpenRegex (GPLv3) [3] in a project, in combination with Apache OpenNLP.
It is a "regular expression language and engine" that users can use to match string and NLP
tags. For example:
> <string='My Company'> <lemma='be'> <postag='RB'>* (<adjective>:
> Where <lemma='be'> will match any form of be/is/was/were/etc, <postag='RB'>*
one or more adverbs and the last part of the expression will find a named token "adjective"
(JJ is the Penn Tree Bank part of speech tag for adjectives).
> Not sure if your library will work only with text or will support any other approaches
too. OpenRegex has some TODO's in the GitHub Wiki but hasn't been updated in a while. Maybe
if your library could work similarly to OpenRegex, it could be incorporated in Apache OpenNLP
too. Even the LanguageTool team demonstrated some interest in experimenting it [4].
> Just food for thought :-)Bruno
> [1][2][3][4]
>       From: Benson Margulies <>
>  To: Commons Developers List <>
>  Sent: Saturday, January 31, 2015 1:58 PM
>  Subject: Anyone interested in regular expressions, again?
> So, once upon a time, there was a regex library here. It was retired,
> presumably on the grounds that it was rendered obsolete by the JRE's
> native support.
> However, the JRE's regular expressions have a pretty severe problem;
> they have unbounded (or at least, very, very, bad) execution time for
> some combinations of data and regex.
> To cope with this, we ported the Henry Spencer regular expression
> library (as found in TCL) from C to Java.
> Thus:
> Is anyone interested in this? Give or take the possible IP muddle of
> the original C Code, I could grant it easily.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message