Return-Path: Delivered-To: apmail-harmony-dev-archive@www.apache.org Received: (qmail 45324 invoked from network); 10 Nov 2008 02:20:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Nov 2008 02:20:26 -0000 Received: (qmail 90227 invoked by uid 500); 10 Nov 2008 02:20:32 -0000 Delivered-To: apmail-harmony-dev-archive@harmony.apache.org Received: (qmail 90194 invoked by uid 500); 10 Nov 2008 02:20:32 -0000 Mailing-List: contact dev-help@harmony.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@harmony.apache.org Delivered-To: mailing list dev@harmony.apache.org Received: (qmail 90183 invoked by uid 99); 10 Nov 2008 02:20:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Nov 2008 18:20:31 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of xiaoming.gu@gmail.com designates 72.14.220.159 as permitted sender) Received: from [72.14.220.159] (HELO fg-out-1718.google.com) (72.14.220.159) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2008 02:19:14 +0000 Received: by fg-out-1718.google.com with SMTP id d23so2065379fga.36 for ; Sun, 09 Nov 2008 18:19:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=z7GmxvbBtRd64orSVPMw938YQIGxjIOiq0auhL+TBZY=; b=UETvUJTqEegNTk0qrwxAouhE30ucx3KxKdab0TBoSQuuY4IGreWGs+zxyW/iiA+nnK NtyZ36no24TBhpIPLJvj3sqf8Nk+sX8hxJUK4+j4OLV7jHiKe40XyhoXLw8a4GxdwfX2 OnuMiL5F4DAawvyS1+U9CoXNVLeeNghfRJ5wY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=uH7NnCTWnFv3HdoFq6CW5YpfbMEjnPFCAFF/6N4DsZ9//M2M80q88RRKqtAoQCWCq6 2SF1PukPy3aSHV7fhpFscxmWNgTWU5DG9O4c+f5hGivlb6xTLfqjwnZUeuAMWcg3XZ0e xDVSc5xGTvbcuBkrBfg1rGpUAgLIfswgFUZVU= Received: by 10.181.158.16 with SMTP id k16mr1920815bko.208.1226283595852; Sun, 09 Nov 2008 18:19:55 -0800 (PST) Received: by 10.180.255.3 with HTTP; Sun, 9 Nov 2008 18:19:55 -0800 (PST) Message-ID: <255079590811091819v3fbdb423xf0ad8999db437f3b@mail.gmail.com> Date: Mon, 10 Nov 2008 10:19:55 +0800 From: "xiaoming gu" To: dev@harmony.apache.org Subject: Re: discussion for H5826 In-Reply-To: <255079590810150220k16f16548s2661e8df18a3c9a2@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_64192_17119982.1226283595851" References: <255079590810102144v3e22b2a6ie45c26134a998ab0@mail.gmail.com> <9623c9a50810121823g3b7a3807s164b9655d949b223@mail.gmail.com> <255079590810150220k16f16548s2661e8df18a3c9a2@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_64192_17119982.1226283595851 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, guys. I'm debugging a basic version of global propagation in LIR for IA32. Now a simple test case passed but I encounter a problem when running SPECjvm2008. If we do not use early_prop in server_static or opt (by editing emconf files), then some error happens when running startup.helloworld. I counldn't continue the debugging for global propagation since I don't know whether the bug comes from it. In my opinion, the optimizations should be as independent as possible. Early_prop only does some propagations. I have no idea why it could not be bypassed. I'm going to work on this problem. Any comment or idea? Thanks. Xiaoming On Wed, Oct 15, 2008 at 5:20 PM, xiaoming gu wrote: > Hi, guys. I made some progress today. Now we could get partly benefits by > improving peephole in LIR. > > In current peephole, there is no actual optimization for "dst1, dst2 = MUL > src1, src2(0)". After add this, one 32-bit MUL is eliminated and we got > almost half part benefits (from about 200 to about 222). You may see > optimization process with the following LIR. > > =======before early_prop======= > BB_19 > PersistentId = 20 > ExecCnt = 9999.98 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_18 > Successors: BB_33 [Prob=1] > I41: o87:U_32 =MOV t19:I_32 > I60: (AD:o89:U_32) =CopyPseudoInst/MOV (AU:o87:U_32) > I61: (AD:o90:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32) > I45: o93:U_32 =MOV t35:I_32 > I62: (AD:o95:U_32) =CopyPseudoInst/MOV (AU:o93:U_32) > I63: (AD:o96:I_32) =CopyPseudoInst/MOV (AU:o104(0):I_32) > > > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_19 > Successors: BB_34 [Prob=1] > I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32) > I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o96:I_32 > I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32) > I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o90:I_32) > I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32 > I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32 > I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32) > I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32 > I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32 > > =======after early_prop======= > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_19 > Successors: BB_34 [Prob=1] > I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32 > I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32) > I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32) > I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32 //o99 > is NOT propagate from o103(0) > I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32 > I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32 > I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32 > > =======before 1st peephole======= > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_19 > Successors: BB_34 [Prob=1] > I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32 > I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32) > I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32) > I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32 > I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32 > I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32 > I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32 > > =======after 1st peephole======= > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_19 > Successors: BB_34 [Prob=1] > I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) > I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) > I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32) > I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32) > I53: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32 > I54: s101:I_32 (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32 > I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I56: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32 > I57: s100:I_32 (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32 > > =======before 2nd cg_dce======= > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_19 > Successors: BB_34 [Prob=1] > I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) > I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) > I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32) > I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32) > I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32 > I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32 > I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32) > I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32 > I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32 > > =======after 2nd cg_dce======= > BB_33 > PersistentId = 0 > ExecCnt = 0 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_18 > Successors: BB_30 [Prob=1] > I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) // 0 should > be propagated to s101 in I54 > I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32) // then I65 > and I51 could be eliminated > I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32) > I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32 > I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32 > I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:v19:I_32) > I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32 > I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32 > > =======finally======= > BB_33 > PersistentId = 0 > ExecCnt = 9999.99 > Loop: Depth=0, !hdr, hdr=NULL > Predcessors: BB_18 > Successors: BB_30 [Prob=1] > Layout Succ: BB_30 > Block code address: 238B0128 > 238B0128 I188: (ID:s8(EFLGS):U_32) =XOR t99(EDI):I_32,t99(EDI):I_32 > DEADBEEF I51: (AD:s101(EDI):I_32) =CopyPseudoInst/MOV > (AU:t99(EDI):I_32) > 238B012A I187: (ID:s8(EFLGS):U_32) =XOR s130(EAX):I_32,s130(EAX):I_32 > 238B012C I53: (ID:s8(EFLGS):U_32) =MUL > t100(EDX):I_32,s130(EAX):I_32,t35(EBP):I_32 > 238B012E I54: (ID:s8(EFLGS):U_32) =ADD s101(EDI):I_32,s130(EAX):I_32 > 238B0130 I186: MOV s153(EAX):I_32,t19(EBX):I_32 > 238B0132 I56: (ID:s8(EFLGS):U_32) =MUL > s154(EDX):I_32,s153(EAX):I_32,t35(EBP):I_32 > 238B0134 I57: (ID:s8(EFLGS):U_32) =ADD s154(EDX):I_32,s101(EDI):I_32 > > You may see this optimization is related to early_prop (propagate 0 to > MUL), peephole (change MUL with 0 to MOV 0) and cg_dce (eliminate unused MOV > 0 or other LIR). Now we have problems with early_prop. For early_prop the > problem is that it only do propagation for the operands with unique > definition. That's why 0 is not propagated to the 2nd unnecessary MUL (even > local propagation could do). We need to use better data-flow analysis to do > the propagation. For the left unnecessary instructions after cg_dce, I'm > going to add another early_prop pass before it. Hope it works. > > Xiaoming > > > On Mon, Oct 13, 2008 at 9:23 AM, Xiao-Feng Li wrote: > >> On Sat, Oct 11, 2008 at 12:44 PM, xiaoming gu >> wrote: >> > Hi, all. I find the benefits of this patch come from changing a 64-bit >> MUL >> > (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the >> > optimization is done as a magic replacement, which is not a common way >> to >> > generate code. (https://issues.apache.org/jira/browse/HARMONY-5826) >> >> Thanks for identifying the root cause of improvement. >> >> > Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit >> > machine, the MUL operation is usually translated to (High 32-bit of >> A)*(Low >> > 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low >> > 32-bit of A). But when we know High 32-bit of A and B are both 0, only >> (Low >> > 32-bit of A)*(Low 32-bit of A) needed. >> > >> > Following are the HIR and LIR for result = (a & ffffffffL) * (b & >> ffffffffL) >> > + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the >> > optimization in HIR simplifier or LIR peephole. I'm not sure whether >> > changing int64 operation to int32 operation will bring overhead for >> 64-bit >> > machine. >> >> This needs performance evaluation to answer. This patch looks like >> specific to 32-bit machine, since original sequence of 3 MULs were for >> 32-bit machine. >> >> > I think maybe peephole is a better place. I find there is no >> > peephole optimization for XOR. If JIT could find out the result of XOR >> is 0, >> > then propagates the 0 to MUL and related MUL is eliminated. The left >> problem >> > is whether there is sufficient data-flow analysis in LIR to do the >> > propagation and elimination. >> >> I guess the best place choice depends on the semantics of the >> optimization. Since this is for instruction sequence optimization >> rather than operation logic optimization, I agree it is better to be >> done in LIR peephole. >> >> Thanks, >> xiaofeng >> >> > >> > =====HIR===== >> > I42:ldci8 #4294967295 -) t38:int64 >> > I43:and t37, t38 -) t39:int64 >> > I44:convi8 g23 -) t40:int64 >> > I45:and t40, t38 -) t41:int64 >> > I46:mul t39, t41 -) t42:int64 >> > =====LIR===== >> > 238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32 >> > 238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32 >> > 238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32 >> > 238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32 >> > 238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL >> > s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32 >> > 238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32 >> > 238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32 >> > 238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32 >> > 238B02BE I73: (ID:s8(EFLGS):U_32) =MUL >> > s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32 >> > 238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32 >> > 238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32 >> > DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV >> (AU:v19(EAX):I_32) >> > >> > 238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL >> > s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32 >> > 238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32 >> > 238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32 >> > >> > Any comments? Thanks. >> > >> > Xiaoming >> > >> >> >> >> -- >> http://xiao-feng.blogspot.com >> > > ------=_Part_64192_17119982.1226283595851--