Hi, guys!
I have run NBody test on Harmony and make some hacks for sqrt
implementation. There are also clockticks distribution while running
on Windows XP SP2.
As baseline I have used :
0. Sun 1.6.0_02 (-server): 1200 msecs
100% Other32
1. Clean Harmony (-Xem:server): 23500 msecs
80% hyluni.dll:__ieee754_sqrt().
11% Other32
6% harmonyvm.dll
2. After stubbing sqrt() call with intrinsic [1]: 5300 msecs
40% Other32
29% hyluni.dll:internal_sqrt() and Java_java_lang_Math_sqrt()
20% harmonyvm.dll: serving native calls
3. After inlining internal_sqrt() [2]: 5000 msecs
45% Other32
23% hyluni.dll:internal_sqrt() and Java_java_lang_Math_sqrt()
20% harmonyvm.dll: serving native calls
4. After applying JNI transition improvements [3]: 4700 msecs
50% Other32
25% hyluni.dll:internal_sqrt() and Java_java_lang_Math_sqrt()
10% harmonyvm.dll: serving native calls
It seems to me that it could be improved further if some magic
implementing sqrt() will be used instead on native call. Moreover,
AFAIU the (3) approach is safe since IEEE754 compatibility must be
preserved only for strict mode, whereas (3) approach implements
fastpath for non-strict mode.
Thanks,
Aleksey.
[1]
Index: modules/luni/src/main/native/luni/shared/math.c
===================================================================
--- modules/luni/src/main/native/luni/shared/math.c (revision 589547)
+++ modules/luni/src/main/native/luni/shared/math.c (working copy)
@@ -21,6 +21,7 @@
#undef __P
#endif /* defined(__P) */
#include "fdlibm.h"
+#include "xmmintrin.h"
jdouble internal_ceil (jdouble arg1);
jdouble internal_log (jdouble arg1);
@@ -298,7 +299,9 @@
{
jdouble result;
- result = sqrt (arg1);
+ float arg = (float)arg1;
+ _mm_store_ss( & arg, _mm_sqrt_ss( _mm_load_ss( & arg ) ) );
+ result = arg;
return result;
}
[2]
Index: modules/luni/src/main/native/luni/shared/math.c
===================================================================
--- modules/luni/src/main/native/luni/shared/math.c (revision 589547)
+++ modules/luni/src/main/native/luni/shared/math.c (working copy)
@@ -21,6 +21,7 @@
#undef __P
#endif /* defined(__P) */
#include "fdlibm.h"
+#include "xmmintrin.h"
jdouble internal_ceil (jdouble arg1);
jdouble internal_log (jdouble arg1);
@@ -593,7 +594,13 @@
JNIEXPORT jdouble JNICALL
Java_java_lang_Math_sqrt (JNIEnv * env, jclass jclazz, jdouble arg1)
{
- return internal_sqrt (arg1);
+ jdouble result;
+
+ float arg = (float)arg1;
+ _mm_store_ss( & arg, _mm_sqrt_ss( _mm_load_ss( & arg ) ) );
+ result = arg;
+
+ return result;
}
JNIEXPORT jdouble JNICALL
[3] https://issues.apache.org/jira/browse/HARMONY-4704
On 11/6/07, Tim Ellison wrote:
> Egor Pasko wrote:
> > On the 0x385 day of Apache Harmony Tim Ellison wrote:
> >> Harmony 5.0 M3 prints
> >> Result = 6.666661664588418E8 in 1352ms
> >> IBM 5.0 SR5a prints
> >> Result = 6.666661664588418E8 in 40ms
> >> Sun 1.6.0-b105 prints
> >> Result = 6.666661664588418E8 in 40ms
> >
> > Wow, good catching microbenchmark! Tim++
>
> Credit to Xiao-Feng for that, not me.
>
> >> Can we squeeze a bit more out of the JIT?
> >
> > actually, Math.sqrt() is native and taken from the slow and portable
> > FDLIBM.
>
> I guess that was my point -- I'm making the assumption, given the orders
> of magnitude difference in time, that we are going through JNI and
> IBM/Sun are recognizing the method in the JIT compiler and creating code
> for it directly.
>
> > I believe, libc implementation is supposed to be
> > faster. -ffast-math does not suit us because sqrt() is a part of
> > IEEE754 standard which is a part of Java (with signalling exception).
>
> I defer to you for the compatibility between the hardware instructions
> and Java standard, but looks like others are doing the 'right thing', or
> am I making a wrong assumption here?
>
> > In fact, NBody multiplies by 1/sqrt(x) function, and this is where
> > both JIT and native implementation could help. multiplication and
> > 1/sqrt(x) must be cheaper than sqrt(x) and division, but I am not sure
> > it would make an IEEE754 compatible solution.
>
> ack.
>
> Regards,
> Tim
>
>
>