harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksey Shipilev" <aleksey.shipi...@gmail.com>
Subject Re: [performance] quick sort is 4x slower on Harmony
Date Wed, 09 Jan 2008 18:23:23 GMT
It looks like "lack of on-stack-replacement" is not the only issue. I
had profiled first iteration of server-mode Harmony after 10 seconds
of iteration start and see this: the percents are clockticks of entire
process:
 48% libpthread-2.3.4
 22% libem
 14% Other32
 7% libc
 7% libharmonyvm
 3% libhythr
 3% libjitrino

Note only 14% of workload code - that gives ~7x degradation.

As for local method distribution:
 25% libpthread#__pthread_mutex_lock
 23% libpthread#__pthread_mutex_unlock
 13% libem#ValueMethodProfile::addNewValue
 4% libem#_Rb_tree::find
 4% libharmonyvm#helper_get_interface_vtable
 3% libem#value_profiler_add_value
 3% libharmonyvm#rth_get_interface_vtable
 1% libem#EdgeProfileCollector::isMethodHot

76% of total time is spent on gathering and maintaining the profile,
most of the time is locking. AFAIK, that's the mutexes guarding
profile information. That perfectly correlates with my previous
experiment for checking of "on-stack-replacement" hypothesis.

We might want to employ Thread Profiler to check what can be done. Can
we somehow avoid locking there? Maybe some "thin locking" on native
side like thin monitors in Java?

Thanks,
Aleksey,
ESSD, Intel.

On Jan 9, 2008 7:10 PM, Aleksey Shipilev <aleksey.shipilev@gmail.com> wrote:
> Hi, guys!
>
> I see no point in measuring computational performance for one
> iteration - there a compilation stage that contributes significantly
> at first stages, at least on Harmony. I've modified the test by
> wrapping main() body in cycle (basically, what Egor did), and then
> measured its performance.
>
> So, on Linux/RHEL4/ia32 at 16-way Tulsa 3.2 Ghz / 16 Gb DDR:
>
> === /localdisk/jdk1.6.0_02/bin/java -client GenericQuicksort ===
> iteration 0: elapsed: 4798ms
> iteration 1: elapsed: 4780ms
> iteration 2: elapsed: 4749ms
> iteration 3: elapsed: 4860ms
> iteration 4: elapsed: 4862ms
>
> === /localdisk/jdk1.6.0_02/bin/java -server GenericQuicksort ===
> iteration 0: elapsed: 4903ms
> iteration 1: elapsed: 4830ms
> iteration 2: elapsed: 5161ms
> iteration 3: elapsed: 5122ms
> iteration 4: elapsed: 5128ms
>
> === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:client
> GenericQuicksort ===
> iteration 0: elapsed: 47270ms
> iteration 1: elapsed: 29312ms
> iteration 2: elapsed: 29324ms
> iteration 3: elapsed: 29278ms
> iteration 4: elapsed: 29401ms
>
> === /nfs/pb/home/ashipile/jre-r610377-work/bin/java -Xem:server
> GenericQuicksort ===
> iteration 0: elapsed: 184097ms
> iteration 1: elapsed: 5684ms
> iteration 2: elapsed: 5664ms
> iteration 3: elapsed: 5661ms
> iteration 4: elapsed: 5680ms
>
> So, if we will compare hot recompiled code only (that's the case of
> latest iterations), we have Harmony only 1.1x slower.
>
> That's an interesting thing - first iteration takes 37x more time than
> RI, and -Xverbose:em shows that recompilation stops after 3 seconds,
> so there are optimized code exists, but it isn't working. That's
> probably because qsort() is not enterable, so JIT can't replace the
> code for it.There should on-stack replacement come, as Egor mentioned.
> But if I try to make re-enterable:
>
>    private static void qsort(QuickSortable sortable, LinkedList<Range> stack) {
>        while (!stack.isEmpty()) {
>                 qsortImpl(sortable, stack);
>        }
>    }
>
>    private static void qsortImpl(QuickSortable sortable,
> LinkedList<Range> stack) {
>       ...
>    }
>
> ...still, I have the same problem, but I my thoughts qsortImpl should
> not suffer from absence of on-stack replacement. I have disabled
> inline to make sure it is not inlined.
>
> I will follow up with updates later.
>
> Thanks,
> Aleksey,
> ESSD, Intel.
>

Mime
View raw message