This sounds like https://issues.apache.org/jira/browse/HIVE2586 , where comparing float/doubles
will not work because of the way floating point numbers are represented.
Perhaps there is a comparison between a float and double type because of some internal representation
in the Java library, or the UDF.
Ed Capriolo's book has a good section about workarounds and caveats for working with floats/doubles
in hive.
Thanks,
Lauren
From: Periya.Data [mailto:periya.data@gmail.com]
Sent: Friday, December 07, 2012 1:28 PM
To: user@hive.apache.org; cdhuser@cloudera.org
Subject: Hive doubleprecision question
Hi Hive Users,
I recently noticed an interesting behavior with Hive and I am unable to find the reason
for it. Your insights into this is much appreciated.
I am trying to compute the distance between two zip codes. I have the distances computed in
various 'platforms'  SAS, R, Linux+Java, Hive UDF and using Hive's builtin functions. There
are some discrepancies from the 3rd decimal place when I see the output got from using Hive
UDF and Hive's builtin functions. Here is an example:
zip1 zip 2 Hadoop Builtin function SAS R
Linux + Java
00501
11720
4.49493083698542000
4.49508858
4.49508858054005
4.49508857976933000
The formula used to compute distance is this (UDF):
double long1 = Math.atan(1)/45 * ux;
double lat1 = Math.atan(1)/45 * uy;
double long2 = Math.atan(1)/45 * mx;
double lat2 = Math.atan(1)/45 * my;
double X1 = long1;
double Y1 = lat1;
double X2 = long2;
double Y2 = lat2;
double distance = 3949.99 * Math.acos(Math.sin(Y1) *
Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1  X2));
The one used using builtin functions (same as above):
3949.99*acos( sin(u_y_coord * (atan(1)/45 )) *
sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
(atan(1)/45)  m_x_coord * (atan(1)/45)) )
 The Hive's builtin functions used are acos, sin, cos and atan.
 for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc)
 All variables used are double.
I expected the value from Hadoop UDF (and Builtin functions) to be identical with that got
from plain Java code in Linux. But they are not. The builtin function (as well as UDF) gives
49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The
linux machine is similar to the Hadoop cluster machines.
Linux version  Red Hat 5.5
Java  latest.
Hive  0.7.1
Hadoop  0.20.2
This discrepancy is very consistent across thousands of zipcode distances. It is not a oneoff
occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples:
zip1 zip 2 Hadoop Builtin function SAS R
Linux + Java
00602
00617
42.79095253903410000
42.79072812
42.79072812185650
42.79072812185640000
00603
00617
40.24044016655180000
40.2402289
40.24022889740920
40.24022889740910000
00605
00617
40.19191761288380000
40.19186416
40.19186415807060
40.19186415807060000
I have not tested the individual sin, cos, atan function returns. That will be my next test.
But, at the very least, why is there a difference in the values between Hadoop's UDF/builtins
and that from Linux + Java? I am assuming that Hive's builtin mathematical functions are
nothing but the underlying Java functions.
Thanks,
PD.
