harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Astapchuk <alex.astapc...@gmail.com>
Subject Re: [drlvm][jit][opt] HARMONY-3243 It's possible to fill array with constant much faster
Date Tue, 20 Mar 2007 15:40:16 GMT
Hi Egor,

Thank you for your reply. Please find my answers inlined.

Egor Pasko wrote:
> On the 0x29E day of Apache Harmony Alex Astapchuk wrote:
>> Hi Egor,
>>
>> Egor Pasko wrote:
>>> On the 0x29B day of Apache Harmony Alex Astapchuk wrote:
>>>> Hi Egor,
>>>>
>>>> Thank you for you input. Please, find the answers inlined.
>>>>
>>>> Egor Pasko wrote:
>>>>> On the 0x29B day of Apache Harmony Alex Astapchuk wrote:
>>>>>> Hi Egor,
>>>>>>
>>>>>> Thank you for your valuable comments. Please, find the answers inlined.
>>>>>>
>>>>>> Egor Pasko wrote:
>>>>>>> JIT guys,
>>>>>>> ...on the HARMONY-3243 patch I would like an open discussion
in dev@
>>>>>>> rather than a limited one in JIRA.
>>>>>>> first of all, the latest faf_em64t.patch crashes for me with
the
>>>>>>> message:
>>>>>>> java:
>>>>>>> /home/xxx/svn/1/trunk/working_vm/vm/jitrino/src/shared/LoopTree.h:85:
>>>>>>> bool Jitrino::LoopTree::hasLoops() const: Assertion `isValid()'
>>>>>>> failed.
>>>>>>> SIGABRT in VM code.
>>>>>>> Stack trace:
>>>>>>> addr2line: '[vdso]': No such file
>>>>>>> (hangs here)
>>>>>> Could you please add the steps to reproduce into the JIRA?
>>>>> done (maybe, my version is overly old, I'll check)
>>>> Thanks a lot!
>>>>
>>>>>>> next, with all great respect to Nikolay Sidelnikov for his efforts
I
>>>>>>> think, this idea needs a more careful analysis and probably
>>>>>>> reimplementation.
>>>>>>> The story is: 2 new optimization passes introduced: one in High-level
>>>>>>> Optimizer nd one in Code Generator:
>>>>>>> pass.1 recognizes a simple loop initialization with a simple
pattern,
>>>>>>>        substitutes it with a JIT helper call
>>>>>>> pass.2 takes the corresponding call and substitutes it with a
>>>>>>>        Low-Level-IR as it thinks the best code is
>>>>>>> this should hypothetically improve one simple code pattern (that
is
>>>>>>> probably widely used?), i.e.: for (int i=0;i<I;i=++){A[i]=X;}
>>>>>>> What I figured out looking at the patch:
>>>>>>> * [pass.2] does not seem to throw any AIOutOfBoundsException
>>>>>> It moves bound checks outside of the loop and does the 'fast' loop
>>>>>> only when the check guarantees that no outofbounds happens. Otherwise,
>>>>>> it falls into a 'regular' version.
>>>>> OK, but I still cannot not find the generation of code for the
>>>>> regular
>>>>> version in CG. Maybe, I a overlooking something. But, anyway,
>>>> Sure.
>>>> You can't find it in CG. It's in HLO.
>>> oops, of course:) thanks. If the patch worked I would have figured
>>> it out.
>> Btw, in your stack trace provided, the revision number is 515555 which
>> is pretty old.
>> As far as I know, it's not reproducible on recent builds. Anyway, thanks
>> for pointing out to the issue.
> 
> OK
> 
>>>> [If I'm not mistaken] from what I see the improved variant gets added
>>>> after the bounds checks. Again - in HLO pass, not in CG.
>>> yep
>>>
>>>>> supplemented tests with various conditions of bounds checking are
>>>>> absolutely necessary for this patch.
>>>>>
>>>>>>> * [pass.2] does not have any useful tuning parameters such as
number
>>>>>>>            of unrolls per loop, thus the scheme eats potential
benefit
>>>>>>>            from loop unrolling and does not give any tuning back
>>>>>> The optimization is mostly orthogonal to the loop unrolling - see
>>>>>> also below.
>>>>> by saying "orthogonal" you mean, loop unrolling together with this
>>>>> arrayFilling can give more on a certain code sample than each of the
>>>> No.
>>>> What I meant is few lines below in the same message.
>>>>
>>>> It's about more effective instructions and BBP overkill.
>>>>
>>>>> optimizations alone? it is not true. these do not complement each
>>>>> other as far as I can see. But in case of abcd+versioning+unrolling+simple_fill
>>>>> they could.
>>>>> for simple_fill idea see below
>>>>>
>>>>>>> * [pass.1] detects such a rare pattern that I doubt it would
benefit a
>>>>>>>            user (but obviously will benefit a you-know-which-benchmark
>>>>>>>            runner)
>>>>>> It quite common for many string intensive applications, including
>>>>>> XML-oriented.
>>>>>> I would suppose it's even more common than 'static array initialization'
>>>>>> pattern that is explicitly handled in the translator:
>>>>>> you may have a look into
>>>>>> jitrino/src/translator/java/JavaByteCodeTranslator.cpp,
>>>>>> methods JavaByteCodeTranslator::newarray(), checkForArrayInitializer().
>>>>> maybe..
>>>>> to prove that, do you have a handy popular open source benchmark to
>>>>> show this extreme space filling in XML-intensive comptations? are
>>>>> there many XML documents of 1000 spaces in a row that make sense?
>>>> Please, see below.
>>>>
>>>>>>> * [pass.1] has a lot of new code that introduces potential instability
>>>>>> Could you please clarify more about instability - what do you mean
exactly?
>>>>> I mean, it needs more testing. More corner cases with AIOOBE at
>>>>> least.
>>>>>
>>>>>>>            (if the pattern was detected not properly, the code
does
>>>>>>>            not read easily), but does not contain a single unit
test
>>>>>>>            or the like. Together with AIOOBE issue stability
becomes a
>>>>>>>            real question.
>>>>>>> * back branch polling is not performed (which is probably good
for
>>>>>>>   performance, but I would better have a tuning option)
>>>>>> BBP is useless (and even harmful) on short finite loops for various
>>>>>> reasons - it adds a memory access (a flag check), a branch, and a
live
>>>>>> range. For hot short loops it's overkill.
>>>>> in tp.java they are not quite short. And XML-intensive applications
>>>> 'quite short' is 'quite subjective' :-)
>>>>
>>>> Please, have a deeper look into the provided micro benchmark.
>>>> It has nothing to do with 1000 spaces in a row.
>>>>
>>>> It works with *80 chars* length strings. I personally find it a good
>>>> average. If we find it productive, we may examine the same tests for
>>>> various lengths, but I bet it will work for the all the real strings.
>>>>
>>>>> trigger GC often, and GC needs this BBP to suspend faster. And, it is
>>>>> not an obvios thing. I would prefer complementary optimizations that
>>>> Huh? Did I read it correctly?
>>>> Do you *really* mean that such short loops may affect a thread
>>>> suspension??
>>>>
>>>> I don't believe this is true.
>>>>
>>>> Such short loops take just dozens of clockticks to finish. While the
>>>> thread suspension involves much more heavy code sequence including spin
>>>> loops, waiting on events and context switches. They are just not
>>>> comparable in time.
>>>> Please correct me if my estimates are wrong.
>>> Alex, let's then put a limit of 80 (or
>>> MAX_ALLOWED_CHARS_FOR_ARRAY_FILL) chars to this optimization. There is
>>> no such limit currently.
>> Sorry, did not get it. For what reason?
> 
> fillArray may gently touch a long loop. This has a negative effect on
> suspension time.

'has'?
Do you have any estimate on what you call 'negative effect', please?

Ok, back to the reality :-)
Filling a string of 65K 16bit chars (which is much longer than the real
strings) takes about 0.00001s even on a slow Pentium *M* / 1.7 GHz.
This is far from the 'effect'.
The time is somehow O(n) - filling of a string with 65M chars
(which in fact is 128M bytes) takes 0.14s.
On powerful server-class boxes the numbers will be even lesser.

Taking even these numbers into account, the effect is near-zero.
On an app playing with 128Mb thingies, the thread suspension time is
far from dominating or even significant. It will spend much more on
allocations, collections, accesses to these, etc.

>>>>> do not kill each other. And tuning parameters for geneting algorithm
>>>>> to choose non-obvious heuristics.
>>>>>
>>>>>> Throwing away the BBP is one of the goals for the optimization...
>>>>> loop finiteness is easily detected in this case. And I would prefer
>>>>> more general algorithms that optimize a variety of situations. This
>>>>> specific optimiztion detects only one specific case, has a lot of code
>>>>> and not properly tested. Are we still targeting on stability?
>>>> Hmmmm.... Could you please be more specific? 'lot of code' is kind of
>>>> emotional characteristic. :-)
>>>>
>>>> What makes you think it was not tested properly?
>>> no tests in JIRA
>> There are many other tests performed during a feature preparation.
>> These include 'build test', mini- and other benchmarks and other apps.
>>
>> Also, as you see your quick-look presumption about missed OutOfBounds
>> check was also incorrect - all the existing machinery is still on place.
>> But if you have any suggestion on better testing in general, then it's
>> always welcome.
> 
> I have a suggestion to reflect testing efforts in JIRA Reasonable? I
> think, yes.  Unit tests al ALWAYS welcome. Specific corner cases are
> always welcome to test. I doubt 'build test' and other benchmarks and
> other apps test fillArray in a good way. If somebody invented new
> microbenchmarks to test on, they should be in JIRA.
> 
> Makes sense?

Yes, all of these sounds reasonable.
I was just mistakenly dazzled by the categorical claim that the
optimization 'was not tested properly' - sorry about that.


>>>> Bugs happen sometimes, it's sad. :-(
>>> I probably know :)
>>>
>>>> It was really helpful you discovered the problem before it goes into
>>>> the main trunk.
>>> Alex, sorry if I was too emotional. My bad. I think, it is not a good
>>> place here for emotions like these. But let's stick to more
>>> engineering subjects.
>>>
>>>>>>> What I can say more is that a good "ABCD" optimization complimented
>>>>>> ... and ABCD does nothing with BBP, isn't it?
>>>>>>
>>>>>>> with "loop versioning" optimiztion will make a more readable,
more
>>>>>>> stable code, AND will give a better performance gain (loop unrolling
>>>>>> Unfortunately no, it will not. :-(
>>>>>>
>>>>>> As you probably know ;-), char is 16 bits wide in Java.
>>>>> I know, probably :)
>>>>>
>>>>>> The code generated for char moves have to use 16 bit movement
>>>>>> instructions. These instructions include operand-size change prefix
>>>>>> (66h) that makes CPU decoder feel bad.
>>>>> yukes
>>>>> Is this on super-duper Intel Core Whatever machines with decoders so
>>>>> badly feeling or it is about those somewhere living 486 processors?
>>>> It's true for many of modern processors. [I suppose] they have to
>>>> support such instructions for the tones of existing legacy code.
>>>> While providing a much better alternatives for modern compilers like
>>>> ICL or Jitrino.
>>>> If you're really interested in details, please refer to
>>>> the "Intel(R) 64 and IA-32 Architectures Optimization Reference Manual".
>>>> The link is at [1], and you may be interested in #3.4.2.3
>>>> "Lenght-Changing prefixes (LCP)".
>>> 1. interesting reading. Pipeline stalls due to instruction
>>>    decoding. (I will try to be less emotional than I am now) In my
>>>    humble opinion we are writing to memory here. That is the place to
>>>    stall. Instruction decoding is generally somewhat 10 times faster
>>>    (if not 100).
>> Ah, these humble opinions... They are so human. :-)
>> Modern out-of-order processors like Intel's Core 2 Duo, are really hard
>> to predict having only a common sense and a single topic from manual.
>> Science begins where measurement begins.
>> Vtune may become your best friend here.
> 
> modern processors are not easy to predict, I know. You may have been
> less verbose on that.
> 
>>> 2. did you measure the actual difference?
>> YES!
>> Reading of JIRA points out few times speedup for the micro benchmark.
>> As it's common for micro benchmarks, YMMV on real apps depending on
>> characteristics of workload.
> 
> To be specific. Did you measure long encoding against short encoding
> with all the rest unchanged? If yes and the difference was
> sugnificant, then cool. That was the question.

Aha, I see now.

Unfortunately, what your asking about is hard to measure directly.
What you call 'long encoding against short encoding' in fact means
replacement of 16 bit operations with 32 bit ones (and I personally
would stuck to directly compare these).

We did not start implementation with just a hope 'it will help 'cause 
it's better'.
We did measures first to find out the sources of inefficiency.

Among others, VTune showed significant stalls due to LCP in our code.
The reworked code shows zero LCP stalls and works much faster.
I would presume that most of the speedup comes from the better
instructions. Our experiments showed pretty low boost from BBP removal
itself (if I'm not mistaken, it was about 30%, but not 'times').
Unrolling games gave almost nothing.

Anyway, a simple switch from the 16 bit operation to 32 bits gives
roughly 2 times speedup on micro benchmarks (for instance from 1.98s to
1.00s on the same PentiumM as above), though again, I would not compare
these numbers directly.


>>> 3. #3.4.2.4 says that decoded insts are used for small loops. Is the
>>>    optimized sequence that large? more than 18 instructions? Well, I
>>>    did not count, maybe, maybe. But for number-crunching loops, not
>>>    memory accesses.
>> I was told that optimizations work whether you believe in them or not. :-)
> 
> bad joke or route too long :)
> 
>>>>>> Whatever unrolling or versioning
>>>>>> would leave these heavy instructions on place.
>>>>> If so, write a 5 line optimization pass that replaces the operation
>>>>> encoding in the (unrolled!) loops. In this case the optimization will
>>>>> be a) more general/useful/complementing-each-other b) more readable.
>>>> It's an interesting idea. It would be really interesting to see such a
>>>> pass implemented and also measure it against HARMONY-3243.
>>> I would love to implement versioning. Maybe, not so fast though..
>> Hmmm... I somehow recall someone said 'piece of cake'?
>> For HLO-oriented guy like you? ;-)
> 
> I do not remember
> 
>> Anyway, please don't treat my joke wrong - feel free to submit it when
>> it's convenient for you.
> 
> I do not see the joke. Sorry.
> If you have suggestions, they are welcome.
> 
>>>>>> One of the goals was to throw the heavy instructions away and replace
>>>>>> them with more effective and fast ones. It's somehow hard to do in
>>>>>> codegen separately, but much easy (and clearly) with a little help
from
>>>>>> the HLO side - this is rationale for the HLO pass.
>>>>> Okay, now I begin understanding the whole motivation. Thanks! Could
>>>>> be
>>>>> in JIRA, btw.
>>>>> And now you have to support the helper in all codegenerators. I am
>>>>> not
>>>> Not really. As you probably know ;-) the optimization selection scheme
>>>> in Jitrino is extremely flexible. If a codegen does not need a
>>>> particular optimization, it may be turned off as easy as typing a single
>>>> '-' in .emconf.
>>> Ah, of course. I am somewhat stupid today, and probably not just
>>> today :)
>>>
>>>>> so happy. My scheme is more simple and portable.
>>>> ... but it does not exist in the code ;-) [yet?] ...
>>> thanks for pointing this out :) I probably know.
>> My pleasure. That's what friends are for.
>>
>>
>>>>>>> is awake too). Setting aside the fact that the overall design
will be
>>>>>>> more straightforward (having no interdependent passes, extra
helpers, etc)
>>>>>>> So I vote for focusing on ABCD plus "loop versioning" and leaving
>>>>>>> specific benchmark-oriented tricks (complicating our design)
alone.
>>>>>> Again, the optimization is orthogonal to the ABCD and was never
>>>>>> positioned as replacement for ABCD. Optimization's main target are
>>>>>> string (and XML) intensive apps.
>>>>> thanks for explanation
>>>>>
>>>>>>> An experienced hacker would say that all compiler reasearch is
a bunch
>>>>>>> of hacks that influence each other in unexpected ways. Well,
maybe,
>>>>>>> but I do not like this particular hack (with all respect to Nikolay
>>>>>>> for his efforts)
>>>>>> In no way this is a hack. If you see a better room for the
>>>>>> improvements here, patches are always welcome. :-)
>>>>> I would prefer if we spent time on more usable approaches as loop
>>>>> versioning that take almost the same time to implement, are more
>>>>> portable, more readable, and solve a wider variety of optimization
>>>>> opportunities.
>>>> ... Seriously, I don't see any reason why to oppose one optimization
>>>> against another.
>>> if it makes design more complicated? more bugs that happen? I would
>>  > not say these words in general (for a production quality piece of code).
>>
>> Did you said 'design'? How comes that a single simple optimization pass
>> affects a whole Jitrino's design? You must be kidding...
>>
>> Could you please be more specific on what you are talking about?
> 
> Careful reading can unveil it. I pointed out on:
> 
> 1. extra helper (one exptra helper does not make a design very
>    complivated, but I would generally avoid doing unnecessary things)

Hmm... Just to make it clear: in no way this is or become a part of
Jitrino's design.
This is implementation of the particular optimization pass.

You can find the Jitrino *design* described here:
http://harmony.apache.org/subcomponents/drlvm/JIT.html

> 2. this optimization has non-obvios side effects on BBP and

Let me presume that with 'BPP' here you really mean 'thread suspension'.
Then it has near-zero effect - please see numbers above.


>    unrolling. (now we seem to agreed that they complement each other
>    most of the time)
> 
> they affect the design. Just for reference. Not for endless discussion.
> 
>> All these 'more complicated', etc are too loosely. If you have any
>> particular suggestion it would be nice to discuss, rather than talking
>> about abstract general wordings. Or it would be even better to see a
>> patch on what you think need an update and why.
> 
> suggestions on testing can be seen above
> 
>> And sorry - I did not get about the bugs.
>> Bugs can be easily fixed (if any) - what's the problem with it?
> 
> 1. sometimes not easily
> 2. our first priority is to become more stable

You cares about stability much, which itself is a good thing.

>>>> Surely, it would be just wonderful to have all optimizations at once.
>>>>
>>>> I really mean 'all' - 'every single possible optimization'.
>>>> It would really great.
>>> yep
>>>
>>>> Thought the software development (as you probably know ;-) ) is
>>>> iterative and evolutionary process.
>>> there are evolutions and evolutions
>>> I am not trying to convince anyone to stop evolution, no. I wanted to
>>> emphasize that we can look behind and see that another approach would
>>> (probably) have been better and almost would have been almost as hard
>> In this 'probably better' the key is 'probably', and 'better' is far
>> from proven. :-)
>> Luckily, we still have a chance to see this 'probably' with all the
>> highest standards, and even compare it with the existing things.
> 
> Seems like endless. Alex, you want to avoid technical discussions?

I'm really sorry if something in my posts made you think so.
In fact, I was constantly asking to provide *the technical* facts or 
estimates rather than emotional reactions, but it's all my fault if you
threated my words opposite. :-(

Anyway, this thread become really hard to read. If you still think there
is anything to discuss, then could you please post another topic?

-- 
Yours faithfully,
      Alex

 > Nice, come and mail me in private. I love your speech.


Mime
View raw message