commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Anyone interested in regular expressions, again?
Date Sat, 31 Jan 2015 17:29:27 GMT
On Sat, Jan 31, 2015 at 12:22 PM, Bruno P. Kinoshita
<brunodepaulak@yahoo.com.br> wrote:
> Hi Benson!
> I wouldn't be able to help at the moment, but some years ago I had a performance issue
in a Nutch crawler with regexes [1] and found about this other library that you mentioned
I think. Are you talking about ORO?

Yes, I believe I'm referring to ORO. I'm not really even looking for
help. I am looking to see if there is enough interest to justify
exploring pushing the code into the ASF. We did benchmarks, it's
faster than built-in Java in a variety of cases. We are precluded from
using GPL, so we didn't look seriously at OpenRegex. We want to have a
system where outsiders can supply any regex they like and we don't
have to worry about one of our servers being eaten by it.


> I ended up changing the regex and never had a chance to play with ORO or other libraries
to see if there was any advantage over not using JRE's regex API. Recently I had another performance
problem with Apache Hive SerDe and performance problems and fixed it by changing the storage
format and simplifying the regex.
> Have you done any performance comparison with your code and other libraries? More or
less like this [2]? Maybe this library could be used as an alternative in Nutch, Commons Crawl
or in other projects when performance was important.
> Lastly, I'm using OpenRegex (GPLv3) [3] in a project, in combination with Apache OpenNLP.
It is a "regular expression language and engine" that users can use to match string and NLP
tags. For example:
> <string='My Company'> <lemma='be'> <postag='RB'>* (<adjective>:
<postag='JJ'>))
> Where <lemma='be'> will match any form of be/is/was/were/etc, <postag='RB'>*
one or more adverbs and the last part of the expression will find a named token "adjective"
(JJ is the Penn Tree Bank part of speech tag for adjectives).
> Not sure if your library will work only with text or will support any other approaches
too. OpenRegex has some TODO's in the GitHub Wiki but hasn't been updated in a while. Maybe
if your library could work similarly to OpenRegex, it could be incorporated in Apache OpenNLP
too. Even the LanguageTool team demonstrated some interest in experimenting it [4].
> Just food for thought :-)Bruno
> [1] https://issues.apache.org/jira/browse/NUTCH-1014[2] http://tusker.org/regex/regex_benchmark.html[3]
https://github.com/knowitall/openregex[4] http://sourceforge.net/p/languagetool/mailman/languagetool-devel/thread/69f229c0a58d3245d511dafaa82feafc%40danielnaber.de/#msg31280519
>
>       From: Benson Margulies <bimargulies@gmail.com>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Saturday, January 31, 2015 1:58 PM
>  Subject: Anyone interested in regular expressions, again?
>
> So, once upon a time, there was a regex library here. It was retired,
> presumably on the grounds that it was rendered obsolete by the JRE's
> native support.
>
> However, the JRE's regular expressions have a pretty severe problem;
> they have unbounded (or at least, very, very, bad) execution time for
> some combinations of data and regex.
>
> To cope with this, we ported the Henry Spencer regular expression
> library (as found in TCL) from C to Java.
>
> Thus: https://github.com/basis-technology-corp/tcl-regex-java
>
> Is anyone interested in this? Give or take the possible IP muddle of
> the original C Code, I could grant it easily.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message