harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Egor Pasko <egor.pa...@gmail.com>
Subject Re: [performance] quick sort is 4x slower on Harmony
Date Wed, 09 Jan 2008 21:25:48 GMT
On the 0x3C6 day of Apache Harmony Aleksey Shipilev wrote:
> It looks like "lack of on-stack-replacement" is not the only issue. I
> had profiled first iteration of server-mode Harmony after 10 seconds
> of iteration start and see this: the percents are clockticks of entire
> process:
>  48% libpthread-2.3.4
>  22% libem
>  14% Other32
>  7% libc
>  7% libharmonyvm
>  3% libhythr
>  3% libjitrino
> 
> Note only 14% of workload code - that gives ~7x degradation.
> 
> As for local method distribution:
>  25% libpthread#__pthread_mutex_lock
>  23% libpthread#__pthread_mutex_unlock
>  13% libem#ValueMethodProfile::addNewValue
>  4% libem#_Rb_tree::find
>  4% libharmonyvm#helper_get_interface_vtable
>  3% libem#value_profiler_add_value
>  3% libharmonyvm#rth_get_interface_vtable
>  1% libem#EdgeProfileCollector::isMethodHot
> 
> 76% of total time is spent on gathering and maintaining the profile,
> most of the time is locking. AFAIK, that's the mutexes guarding
> profile information. That perfectly correlates with my previous
> experiment for checking of "on-stack-replacement" hypothesis.
> 
> We might want to employ Thread Profiler to check what can be done. Can
> we somehow avoid locking there? Maybe some "thin locking" on native
> side like thin monitors in Java?

Aleksey, thank you for this great lots of useful investigation! 

HA! Hello sluggish Value-Profiling :))

I wonder why profile is still collected after 10 seconds. Looks like
it is not the recompiled code that is working. To solve this specific
overprofiling we can introduce value profile readiness and do
'nop'-ing of profiling helper calls just like Jitrino.JET does for
edge profile. But first we need to make sure if it really happens in
this test on steady state.

On locking. We should not pay for high profile accurcy sacrificing
speed of collection (just like we do not lock on edge profiling), but
I currently have no idea how to modify top-n-value tables in value
profiling correctly without synchronizing (in fact, one atomic cmpxchg
would be enough)

Anyway, spinlocks and futexes will help, but I woul prefer a more
intelligent solution here (if possible:)

> On Jan 9, 2008 7:10 PM, Aleksey Shipilev <aleksey.shipilev@gmail.com> wrote:
> > Hi, guys!
> >
> > I see no point in measuring computational performance for one
> > iteration - there a compilation stage that contributes significantly
> > at first stages, at least on Harmony. I've modified the test by
> > wrapping main() body in cycle (basically, what Egor did), and then
> > measured its performance.
> >
> > So, on Linux/RHEL4/ia32 at 16-way Tulsa 3.2 Ghz / 16 Gb DDR:
> >
> > === /localdisk/jdk1.6.0_02/bin/java -client GenericQuicksort ===
> > iteration 0: elapsed: 4798ms
> > iteration 1: elapsed: 4780ms
> > iteration 2: elapsed: 4749ms
> > iteration 3: elapsed: 4860ms
> > iteration 4: elapsed: 4862ms
> >
> > === /localdisk/jdk1.6.0_02/bin/java -server GenericQuicksort ===
> > iteration 0: elapsed: 4903ms
> > iteration 1: elapsed: 4830ms
> > iteration 2: elapsed: 5161ms
> > iteration 3: elapsed: 5122ms
> > iteration 4: elapsed: 5128ms
> >
> > === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:client
> > GenericQuicksort ===
> > iteration 0: elapsed: 47270ms
> > iteration 1: elapsed: 29312ms
> > iteration 2: elapsed: 29324ms
> > iteration 3: elapsed: 29278ms
> > iteration 4: elapsed: 29401ms
> >
> > === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:server
> > GenericQuicksort ===
> > iteration 0: elapsed: 184097ms
> > iteration 1: elapsed: 5684ms
> > iteration 2: elapsed: 5664ms
> > iteration 3: elapsed: 5661ms
> > iteration 4: elapsed: 5680ms
> >
> > So, if we will compare hot recompiled code only (that's the case of
> > latest iterations), we have Harmony only 1.1x slower.
> >
> > That's an interesting thing - first iteration takes 37x more time than
> > RI, and -Xverbose:em shows that recompilation stops after 3 seconds,
> > so there are optimized code exists, but it isn't working. That's
> > probably because qsort() is not enterable, so JIT can't replace the
> > code for it.There should on-stack replacement come, as Egor mentioned.
> > But if I try to make re-enterable:
> >
> >    private static void qsort(QuickSortable sortable, LinkedList<Range> stack)
{
> >        while (!stack.isEmpty()) {
> >                 qsortImpl(sortable, stack);
> >        }
> >    }
> >
> >    private static void qsortImpl(QuickSortable sortable,
> > LinkedList<Range> stack) {
> >       ...
> >    }
> >
> > ...still, I have the same problem, but I my thoughts qsortImpl should
> > not suffer from absence of on-stack replacement. I have disabled
> > inline to make sure it is not inlined.
> >
> > I will follow up with updates later.
> >
> > Thanks,
> > Aleksey,
> > ESSD, Intel.
> >
> 

-- 
Egor Pasko


Mime
View raw message