harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deven You <devyo...@gmail.com>
Subject Fwd: A question about android regex implementation
Date Thu, 08 Jul 2010 03:19:56 GMT
---------- Forwarded message ----------
From: Deven You <devyoudw@gmail.com>
Date: 2010/7/8
Subject: Re: A question about android regex implementation
To: enh <enh@google.com>




2010/7/8 enh <enh@google.com>

On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <jessewilson@google.com> wrote:
> > On Mon, Jul 5, 2010 at 8:02 PM, Deven You <devyoudw@gmail.com> wrote:
> > > I have written some simple benchmarks for harmony regex and find the
> > > performance of harmony is poor compared to RI. For example,
> Mathcer.find()
> > > only reach 60% of that of RI.
>
> i'd be interested to see your benchmark code. our regular expression
> benchmarks actually focus on replacing use of Pattern/Matcher with
> simpler fast paths where the full complexity isn't needed. for
> example, most String.split calls don't actually use regular
> expressions. i'm not sure how much of this work made it into froyo,
> and it's certainly still a work in progress.
>
I first found  harmony regex is slower than RI  is from an internal
benchmark test result. In this benchmark, Mather.find() is relatively hot
and slower than RI. Then I write a simple test to verify it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherFind {


public static void main(String[] args) {
 Pattern pat = Pattern.compile("(\\d{1,3})");
Matcher matcher = pat.matcher("aaaa123456789045");
 //warm up
for (int i = 0; i <1000000; i++) {
 while(matcher.find()) {
}
 matcher.reset();
}
 long time = System.currentTimeMillis();
for (int i = 0; i <10000000; i++) {
 while(matcher.find()) {
}
 matcher.reset();
}
time = System.currentTimeMillis() - time;
               System.out.println("total time of MatcherFind is " + time + "
ms!");

}

The results are harmony takes 9841 -10079 ms meanwhile RI takes 5596 - 5877
ms.

In volano benchmark, Matcher(Pattern, CharSequence) and Mather.group(int)
are also slower than RI, although they are not hot methods. My simple
benchmarks for these 2 methods also approve this.
1. Matcher(Pattern, CharSequence)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherConstruct {
public static void main(String[] args) {
        Pattern pat = Pattern.compile("XX");
        Matcher matcher = null;

        //warmup phase
        for (int i = 0; i < 100000; i++) {
         matcher = pat.matcher("Today is XX-XX-XX ...");
        }

        long time = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
         matcher = pat.matcher("Today is XX-XX-XX ...");
        }
        time = System.currentTimeMillis() - time;
        System.out.println("Total time of MatcherConstruct is " + time + "
ms!");
}
}

2. Matcher.groun(int)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherGroupInt {
 public static void main(String[] args) {
        String testPattern = "(\\d{1,3})";
        String testString = "aaaa123456789045";

        Pattern pat = Pattern.compile(testPattern);
        Matcher matcher = pat.matcher(testString);

        //warm up
        for (int i = 0; i < 500000; i++) {
            while (matcher.find()) {
                String result = matcher.group(1);
               }
            matcher.reset();
        }

        long time = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            while (matcher.find()) {
                String result = matcher.group(1);
               }
            matcher.reset();
        }
        time = System.currentTimeMillis() - time;
        System.out.println("total time of MatcherGroupInt is " + time + "
ms!");
}
}


> one thing we want to do is use http://code.google.com/p/caliper/
> benchmarks to track the relative performance of Java implementations
> of the stuff that's currently native code (but doesn't inherently need
> to be). we don't have anything like that for regular expressions yet.
> (our tool for conveniently running benchmarks is
> http://code.google.com/p/vogar/ and you can find some of our actual
> benchmarks at http://code.google.com/p/dalvik/.)
>
> > > I heard Android use icu4jni re-implement this
> > > module. Since icu4jni use native code I think it may has higher
> performance
> > > than harmony. I am trying to use icu4jni as the back-end of harmony
> regex
> > > but find icu4jni has no functions related to regex operations.
>
> although, as jesse said, regular expressions are a special case where
> there was no original icu4jni implementation, icu4jni is a dead
> project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the
> icu4jni code, but it's been gradually rewritten since then. the
> process is incomplete, and the extent of the rewrite varies from class
> to class, but you're probably better off taking code from Android
> rather than icu4jni.
>
> > > I know there are some android guys in our community. So can anyone tell
> me
> > > some detail info for android's regex, like if it re-implement the regex
> > > logic using native code by android itself rather than icu4jni and
> really get
> > > higher performance compared to harmony regex? Thanks a lot!
> >
> > We actually use icu4c's pattern, which we access via JNI. Documentation
> is
> > here:
> >  http://icu-project.org/apiref/icu4c/classRegexPattern.html
> >
> > Our regex-java integration code is Apache-licensed and available in our
> Git
> > repository here:
> >
> >
> http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master
>
> actually, in shipping versions of Android we use the C API rather than
> the C++ API. see
>
> http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master
> for the gory details.
>
> i've recently rewritten this to use the C++ APIs, and it's much
> cleaner for it. if you look at the current code you'll see that we
> copy the input to the native heap, because icu4c < 4.6 can't cope with
> its input moving about between calls (and a VM can't necessarily
> guarantee the Java char[] doesn't move around).
>
> in future, we'll use a new icu4c API to get the best of both worlds:
>
>
> http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design
>
> if you're shipping your own copy of icu (rather than using some system
> one you have no control over), you might want to backport that change
> (which has been submitted).
>
> one final thing to be aware of is that icu4c's regular expression
> implementation isn't 100% compatible with the RI. most obviously,
> there's no support for CANON_EQ, but there are a variety of more
> subtle incompatibilities too.
>
> --
> Elliott Hughes - http://who/enh - http://jessies.org/~enh/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message