harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deven You <devyo...@gmail.com>
Subject Re: A question about android regex implementation
Date Fri, 09 Jul 2010 08:57:42 GMT
2010/7/8 enh <enh@google.com>

> On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <jessewilson@google.com> wrote:
> > On Mon, Jul 5, 2010 at 8:02 PM, Deven You <devyoudw@gmail.com> wrote:
> > > I have written some simple benchmarks for harmony regex and find the
> > > performance of harmony is poor compared to RI. For example,
> Mathcer.find()
> > > only reach 60% of that of RI.
> i'd be interested to see your benchmark code. our regular expression
> benchmarks actually focus on replacing use of Pattern/Matcher with
> simpler fast paths where the full complexity isn't needed. for
> example, most String.split calls don't actually use regular
> expressions. i'm not sure how much of this work made it into froyo,
> and it's certainly still a work in progress.
I have written a cpp simple becnmark use icu4c regex corresponding to the
java ones to test icu4c regex performance:

  int32_t            flags=0;
  UParseError        pe;
  UErrorCode         status=U_ZERO_ERROR;
  UnicodeString      re("(\\d{1,3})");
  UnicodeString      data = "aaaa123456789045";
  int i=0;

  RegexPattern *pat = RegexPattern::compile(re, flags, pe, status);

  RegexMatcher *matcher = pat->matcher(data, status);

  //warm up
  for (i=0; i<1000000; i++) {
    while (matcher->find()) {

  timeval time_val;
  long sec1, sec2, us1, us2;
  gettimeofday(&time_val, NULL);

  for (i=0; i<10000000; i++) {
    while (matcher->find()) {
  gettimeofday(&time_val, NULL);
  double interval = (sec2 -sec1)*1000.0 + (us2-us1)/1000.0;
  printf("total time of MatcherFind is %f ms \n", interval);

The test result is similar to harmony regex about 9883 to 10977 ms.
However when I use pattern ^([\S]+)\s+([\S]+).*\r?\n and matching string
"Test Matcher-find() performance estimation!\r\n". The test results show
icu4c has the best performance:
harmony:  14962 - 15760 ms
RI:             10999 - 11587 ms
ICU4C:     7193 - 7558 ms
It seems if the regex get complex, ICU4C will do the best performance.

> one thing we want to do is use http://code.google.com/p/caliper/
> benchmarks to track the relative performance of Java implementations
> of the stuff that's currently native code (but doesn't inherently need
> to be). we don't have anything like that for regular expressions yet.
> (our tool for conveniently running benchmarks is
> http://code.google.com/p/vogar/ and you can find some of our actual
> benchmarks at http://code.google.com/p/dalvik/.)
> > > I heard Android use icu4jni re-implement this
> > > module. Since icu4jni use native code I think it may has higher
> performance
> > > than harmony. I am trying to use icu4jni as the back-end of harmony
> regex
> > > but find icu4jni has no functions related to regex operations.
> although, as jesse said, regular expressions are a special case where
> there was no original icu4jni implementation, icu4jni is a dead
> project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the
> icu4jni code, but it's been gradually rewritten since then. the
> process is incomplete, and the extent of the rewrite varies from class
> to class, but you're probably better off taking code from Android
> rather than icu4jni.
> > > I know there are some android guys in our community. So can anyone tell
> me
> > > some detail info for android's regex, like if it re-implement the regex
> > > logic using native code by android itself rather than icu4jni and
> really get
> > > higher performance compared to harmony regex? Thanks a lot!
> >
> > We actually use icu4c's pattern, which we access via JNI. Documentation
> is
> > here:
> >  http://icu-project.org/apiref/icu4c/classRegexPattern.html
> >
> > Our regex-java integration code is Apache-licensed and available in our
> Git
> > repository here:
> >
> >
> http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master
> actually, in shipping versions of Android we use the C API rather than
> the C++ API. see
> http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master
> for the gory details.
> i've recently rewritten this to use the C++ APIs, and it's much
> cleaner for it. if you look at the current code you'll see that we
> copy the input to the native heap, because icu4c < 4.6 can't cope with
> its input moving about between calls (and a VM can't necessarily
> guarantee the Java char[] doesn't move around).
> in future, we'll use a new icu4c API to get the best of both worlds:
> http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design
> if you're shipping your own copy of icu (rather than using some system
> one you have no control over), you might want to backport that change
> (which has been submitted).
> one final thing to be aware of is that icu4c's regular expression
> implementation isn't 100% compatible with the RI. most obviously,
> there's no support for CANON_EQ, but there are a variety of more
> subtle incompatibilities too.
> --
> Elliott Hughes - http://who/enh - http://jessies.org/~enh/

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message