harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksey Shipilev" <aleksey.shipi...@gmail.com>
Subject Re: [classlib] HashMap optimization (again)
Date Sat, 12 Jan 2008 23:35:09 GMT
Hi again, Tim.

So I spent another day for this issue. I've gathered the profile of
SPECjbb2005 and grepped out HashMap methods (okay, I had to disable
inline, so exact numbers differ from actual performance run):

Current implementation:

6.99% HashMap.findNonNullKeyEntry(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.61% HashMap.getEntry(Ljava/lang/Object;)Ljava/util/HashMap$Entry;
0.25% HashMap.get(Ljava/lang/Object;)Ljava/lang/Object;	
7.86% Total

6.01% HashMap.findNonNullKeyEntryInteger(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.67% HashMap.findNonNullKeyEntryLegacy(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.61% HashMap.getEntry(Ljava/lang/Object;)Ljava/util/HashMap$Entry;
0.42% HashMap.get(Ljava/lang/Object;)Ljava/lang/Object;	
0.39% HashMap.findNonNullKeyEntry(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
7.05% Total

Percents are clocktick percents of entire workload.
So, profile shows that H5374 code is actually faster.

Then after talk with Sergey Kuksenko (that's a credit to him :)) I
tried to compare these two implementations without allocPrefetch,
which prefetches the memory for newly created objects and thus
inferring high cache pressure. allocPrefetch itself gives hu-u-uge
boosts, but can expose cache limitations for other optimizations. So,
with allocPrefetch disabled:

Windows x86
100.0% Harmony-clean
101.1% Harmony + H5374

Windows x86_64
100.0% Harmony-clean
100.5% Harmony + H5374

That's the boost I'm looking for! I wonder why such positive change as
manual unboxing changes L2 cache access patterns so it gives boosts in
normal mode and degradation in presence of high L2 cache user.

I had also remeasured all modes accurately, so let's have the
conclusion on this issue:

Windows x86:
 100.0% [base] Harmony-clean
 100.2% [+0.2%] Harmony-clean + H5374
 88.6%   [base] Harmony-clean - allocPrefetch
 89.6%   [+1%] Harmony-clean - allocPrefetch + H5374

Windows x86_64:
 100.0% [base] Harmony-clean
 100.1% [+0.1%] Harmony-clean + H5374
 88.9%   [base] Harmony-clean - allocPrefetch
 89.3%   [+0.5%] Harmony-clean - allocPrefetch + H5374

...measurement uncertainty is about 0.4%.

Basing on this data I would say this patch couldn't get much boost on
DRLVM, since DRLVM's optimizations do their job of scalarization just
fine. The patch should also increase cache locality and it seems to be
the case in absence of another L2 cache contributor. Let's add that
such specialization bloats code a little, and jump to conclusion that
from DRLVM side it would be better to keep patch out of trunk.

There is one more possible opportunity - to tune up prefetch distance
in allocPrefetch, but that's a fragile thing to optimize.


On Jan 11, 2008 12:14 AM, Aleksey Shipilev <aleksey.shipilev@gmail.com> wrote:
> Hi, Tim!
> On Jan 10, 2008 11:53 PM, Tim Ellison <t.p.ellison@gmail.com> wrote:
> > Aleksey Shipilev wrote:
> > > a little update here. I have managed to split problematic method into
> > > two chunks, I will attach new patch to JIRA in few minutes. So far the
> > > picture is following:
> > So this is an attempt to make the method small enough (measured in
> > what?) to be inlined by the JIT, right?  Is there a way to annotate the
> > JIT to always in-line it, say by name, rather than juggling the size?
> AFAIU, to ensure modularity, it would require implementing annotation
> in classlib that should be recognized by any JIT. AFAIK, Jitrino has
> @Inline pragma, but it's defined in DRLVM classes. Anyway, IMO the
> thing I done should be done automatically by JIT, because that's the
> specialization of method which should eliminate unneeded branches and
> then make code capable for inlining.
> That's also the point why your patch is really interesting: it can
> help to understand whether such specialization worth it. If it's true,
> then we might consider implementing some JIT-side optimization or even
> implement similar "manual unboxing" of other primitive types.
> > > Windows x86_64:
> > >  100% Harmony-clean
> > >  97.7% Harmony-clean + H5374 (old)
> > >  99.6% Harmony-clean + H5374 (new)
> > >
> > > Windows x86:
> > >  100% Harmony-clean
> > >  97.9% Harmony-clean + H5374 (old)
> > >  99.6% Harmony-clean + H5374 (new)
> >
> > What benchmark are you using?  I saw this specialization technique give
> > a reasonable boost on SPECjbb2005, with IBM VME not DRLVM, and was
> > hoping it would be applicable here too.
> Yep, it is SPECjbb2005, I thought it was intended, sorry. I don't
> really know IBM VME so much, but I believe the boost on J9 and no
> boost on DRLVM caused by  Scalar Replacement technique implemented in
> Jitrino, which unboxes such primitive types (here, Integer -> int)
> during the compilation, do it neglects effect of your "manual
> unboxing".
> > Avoiding the value dereference enhances data locality and thereby fewer CPU cache
> > I think we should only apply the patch if it produces a benefit to Harmony.
> Yep, that's true. So I was suprised why your patch gives a
> degradation, while it should give a boost - that was the firestarter
> of my investigation - and I haven't complete clue even now, though we
> reduced degradation significantly. I'm looking for some another not so
> obvious problem there...
> Thanks,
> Aleksey,
> ESSD, Intel.

View raw message