From dev-return-39515-apmail-harmony-dev-archive=harmony.apache.org@harmony.apache.org Thu Jul 08 03:21:29 2010 Return-Path: Delivered-To: apmail-harmony-dev-archive@www.apache.org Received: (qmail 57888 invoked from network); 8 Jul 2010 03:21:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jul 2010 03:21:29 -0000 Received: (qmail 57183 invoked by uid 500); 8 Jul 2010 03:21:28 -0000 Delivered-To: apmail-harmony-dev-archive@harmony.apache.org Received: (qmail 56973 invoked by uid 500); 8 Jul 2010 03:21:25 -0000 Mailing-List: contact dev-help@harmony.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@harmony.apache.org Delivered-To: mailing list dev@harmony.apache.org Received: (qmail 56964 invoked by uid 99); 8 Jul 2010 03:21:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 03:21:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of devyoudw@gmail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-ww0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 03:21:17 +0000 Received: by wwa36 with SMTP id 36so1373520wwa.0 for ; Wed, 07 Jul 2010 20:19:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=1aGJx2kSXrMu2QKp8nq+VMSc8D+/9lnGMZL3iWz4Sf4=; b=oly3Qx8xcVybBV6c3SbTew1c28Mfh+44q6jg08ADOF4fH7pRjp8iqXBSv6D2PUvn+b OkFsgV2qPcrnSzg1RbGkArSeVt/DMp3hCG6JzaqyMmSJ+i+JK3gN92pKD6KdBWeX12Mi gsKGqNP19ihHuPA3sxFGH8ofxqm9+QtsLyMqs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Hq+mv2Mi7fzVCnkHykj8aTIZCltDJ704U0WJS6lYQVB7end4Z6CQIN/kfsJScZRTek +Y1/4+aqTwDnlvM/a5VGa0pgbsohc+XjWCS+3IZcyY5B49kz9VvMbAzaLKg9gqPieXX1 ksIwly3kuDGWAn+mE7vxFw4dQ/a9pakcLbFc8= MIME-Version: 1.0 Received: by 10.227.27.75 with SMTP id h11mr376906wbc.90.1278559196927; Wed, 07 Jul 2010 20:19:56 -0700 (PDT) Received: by 10.216.90.207 with HTTP; Wed, 7 Jul 2010 20:19:56 -0700 (PDT) In-Reply-To: References: Date: Thu, 8 Jul 2010 11:19:56 +0800 Message-ID: Subject: Fwd: A question about android regex implementation From: Deven You To: dev Content-Type: multipart/alternative; boundary=002215975b5ec1d90d048ad7c290 X-Virus-Checked: Checked by ClamAV on apache.org --002215975b5ec1d90d048ad7c290 Content-Type: text/plain; charset=ISO-8859-1 ---------- Forwarded message ---------- From: Deven You Date: 2010/7/8 Subject: Re: A question about android regex implementation To: enh 2010/7/8 enh On Wed, Jul 7, 2010 at 09:55, Jesse Wilson wrote: > > On Mon, Jul 5, 2010 at 8:02 PM, Deven You wrote: > > > I have written some simple benchmarks for harmony regex and find the > > > performance of harmony is poor compared to RI. For example, > Mathcer.find() > > > only reach 60% of that of RI. > > i'd be interested to see your benchmark code. our regular expression > benchmarks actually focus on replacing use of Pattern/Matcher with > simpler fast paths where the full complexity isn't needed. for > example, most String.split calls don't actually use regular > expressions. i'm not sure how much of this work made it into froyo, > and it's certainly still a work in progress. > I first found harmony regex is slower than RI is from an internal benchmark test result. In this benchmark, Mather.find() is relatively hot and slower than RI. Then I write a simple test to verify it: import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherFind { public static void main(String[] args) { Pattern pat = Pattern.compile("(\\d{1,3})"); Matcher matcher = pat.matcher("aaaa123456789045"); //warm up for (int i = 0; i <1000000; i++) { while(matcher.find()) { } matcher.reset(); } long time = System.currentTimeMillis(); for (int i = 0; i <10000000; i++) { while(matcher.find()) { } matcher.reset(); } time = System.currentTimeMillis() - time; System.out.println("total time of MatcherFind is " + time + " ms!"); } The results are harmony takes 9841 -10079 ms meanwhile RI takes 5596 - 5877 ms. In volano benchmark, Matcher(Pattern, CharSequence) and Mather.group(int) are also slower than RI, although they are not hot methods. My simple benchmarks for these 2 methods also approve this. 1. Matcher(Pattern, CharSequence) import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherConstruct { public static void main(String[] args) { Pattern pat = Pattern.compile("XX"); Matcher matcher = null; //warmup phase for (int i = 0; i < 100000; i++) { matcher = pat.matcher("Today is XX-XX-XX ..."); } long time = System.currentTimeMillis(); for (int i = 0; i < 1000000; i++) { matcher = pat.matcher("Today is XX-XX-XX ..."); } time = System.currentTimeMillis() - time; System.out.println("Total time of MatcherConstruct is " + time + " ms!"); } } 2. Matcher.groun(int) import java.util.regex.Matcher; import java.util.regex.Pattern; public class MatcherGroupInt { public static void main(String[] args) { String testPattern = "(\\d{1,3})"; String testString = "aaaa123456789045"; Pattern pat = Pattern.compile(testPattern); Matcher matcher = pat.matcher(testString); //warm up for (int i = 0; i < 500000; i++) { while (matcher.find()) { String result = matcher.group(1); } matcher.reset(); } long time = System.currentTimeMillis(); for (int i = 0; i < 1000000; i++) { while (matcher.find()) { String result = matcher.group(1); } matcher.reset(); } time = System.currentTimeMillis() - time; System.out.println("total time of MatcherGroupInt is " + time + " ms!"); } } > one thing we want to do is use http://code.google.com/p/caliper/ > benchmarks to track the relative performance of Java implementations > of the stuff that's currently native code (but doesn't inherently need > to be). we don't have anything like that for regular expressions yet. > (our tool for conveniently running benchmarks is > http://code.google.com/p/vogar/ and you can find some of our actual > benchmarks at http://code.google.com/p/dalvik/.) > > > > I heard Android use icu4jni re-implement this > > > module. Since icu4jni use native code I think it may has higher > performance > > > than harmony. I am trying to use icu4jni as the back-end of harmony > regex > > > but find icu4jni has no functions related to regex operations. > > although, as jesse said, regular expressions are a special case where > there was no original icu4jni implementation, icu4jni is a dead > project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the > icu4jni code, but it's been gradually rewritten since then. the > process is incomplete, and the extent of the rewrite varies from class > to class, but you're probably better off taking code from Android > rather than icu4jni. > > > > I know there are some android guys in our community. So can anyone tell > me > > > some detail info for android's regex, like if it re-implement the regex > > > logic using native code by android itself rather than icu4jni and > really get > > > higher performance compared to harmony regex? Thanks a lot! > > > > We actually use icu4c's pattern, which we access via JNI. Documentation > is > > here: > > http://icu-project.org/apiref/icu4c/classRegexPattern.html > > > > Our regex-java integration code is Apache-licensed and available in our > Git > > repository here: > > > > > http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master > > actually, in shipping versions of Android we use the C API rather than > the C++ API. see > > http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master > for the gory details. > > i've recently rewritten this to use the C++ APIs, and it's much > cleaner for it. if you look at the current code you'll see that we > copy the input to the native heap, because icu4c < 4.6 can't cope with > its input moving about between calls (and a VM can't necessarily > guarantee the Java char[] doesn't move around). > > in future, we'll use a new icu4c API to get the best of both worlds: > > > http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design > > if you're shipping your own copy of icu (rather than using some system > one you have no control over), you might want to backport that change > (which has been submitted). > > one final thing to be aware of is that icu4c's regular expression > implementation isn't 100% compatible with the RI. most obviously, > there's no support for CANON_EQ, but there are a variety of more > subtle incompatibilities too. > > -- > Elliott Hughes - http://who/enh - http://jessies.org/~enh/ > --002215975b5ec1d90d048ad7c290--