On Sep 19, 2008 11:05pm, xiaoming gu wrote:
>
>
> On Fri, Sep 19, 2008 at 8:24 PM, Ian Rogers rogers.email@gmail.com> wrote:
>
> xiaoming gu wrote:
>
>
> Hi, all. I did something more for shladd=>LEA today. With the available
MUL
>
> strength reduction,
>
> X*10 is reduced to (X
> instruction (CASE 3).
>
> Actually this XOR is not necessay and could be eliminated in HIR2LIR pass.
>
> Following is the
>
> better instructions generated with the improve patch. Comparing with
>
> previous CASE 3, you may
>
> find XOR gone.
>
>
>
>
>
> CASE 4: MUL strength reduction - using LEA and taking care of 0
>
>
>
> I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
>
> I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
>
> I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
>
> I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
>
> I26: EmptyPseudoInst bcOff: 48 \l\
>
>
>
> CASE1 CASE2 CASE3 CASE4
>
> Time (msec) 6234 7688 5734 5704
>
> Normalized 1 1.233 0.920 0.915
>
>
>
>
>
> I'm going to submit the patch though it only brings small performance
>
> improvement (0.5%). Any
>
> comment is welcome. Thanks.
>
>
>
> Xiaoming
>
>
>
>
>
> Hi Xiaoming,
>
>
>
> inspired by all this discussion I rehashed some of this code in Jikes
RVM, the results are here [1]. We have a BURS instruction selector so we
don't generate ShiftAdd operations instead the tree pattern matcher will
combine a shift and add into an address [2] that we then turn into an LEA
[3]. Anyway for a multiply by 10 we now produce:
>
>
>
>
> 3 int_move t3si(I) = 0
>
> 3 int_shl t4si(I) = l0si(I,d), 1
>
> 3 int_add t5si(I) = t3si(I), t4si(I)
>
> 3 int_shl t6si(I) = l0si(I,d), 3
>
> 3 int_add t7si(I) = t5si(I), t6si(I)
>
> 3 int_move t2si(I) = t7si(I)
>
>
>
> that becomes:
>
>
>
> 3 ia32_lea t6si(I) = DW
>
> 3 ia32_lea t7si(I) = DW
>
>
>
> Rather than (x
>
>
> 3 ia32_lea EDX(I) = DW
>
> 3 ia32_lea EDX(I) = DW
>
> 3 ia32_shl EAX(I) AF CF OF PF SF ZF
> 3 ia32_add EAX(I) AF CF OF PF SF ZF
>
>
> that is ((x*4)*(1+8))+x*64. If we generated the shift and adds for the
*64 from the preceding shift then we end up losing the add in the 2nd LEA
and having to generate an extra add to compensate, ie:
>
>
>
> 3 ia32_lea t1 = l0*4
>
> 3 ia32_lea t2 = t1*8
>
> 3 ia32_lea t3 = t2*2 + t2
>
> 3 ia32_add t1 = t3 + t1
>
>
>
> I believe we might as well do the shift by 6 and add as at least that can
be done in parallel to the LEAs. I mention this as I think this may be what
your multiply by 100 should look like. It'd be nice if Jikes RVM were doing
simplification of constant division and of course for multiplication we're
not considering subtractions to create the desired result (help welcome :-)
).
>
>
>
>
> Regards,
>
> Ian
>
>
>
> [1]
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src/org/jikesrvm/compilers/opt/Simplifier.java?revision=15001&view=markup#l_1379
>
>
> [2]
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_229
>
>
> [3]
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_297
>
> Hi, Ian. Thanks for your helpful information.
>
> X*100 is reduced to ((X2+0) and ((X
>
> There are SUB and NEG operations in the reduction and I'm going to study
the details next week. You may refer to multiplybyconstant.cpp for MUL
reduction and DIV reduction is in simplifyTauDiv() and simplifyTauRem() in
simplifier.cpp. Hope they are helpful. Thanks.
>
>
> Xiaoming
>
>
Ian, a quick question - is there any instruction scheduling work in Jikes
RVM? If yes, local or global or both? Thanks. -Xiaoming