# hive-user mailing list archives

##### Site index · List index
Message view
Top
From Lauren Yang <Lauren.Y...@microsoft.com>
Subject RE: Hive double-precision question
Date Fri, 07 Dec 2012 22:12:51 GMT
```This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where comparing float/doubles
will not work because of the way floating point numbers are represented.

Perhaps there is a comparison between a  float and double type because of some internal representation
in the Java library, or the UDF.

Ed Capriolo's book has a good section about workarounds and caveats for working with floats/doubles
in hive.

Thanks,
Lauren
From: Periya.Data [mailto:periya.data@gmail.com]
Sent: Friday, December 07, 2012 1:28 PM
To: user@hive.apache.org; cdh-user@cloudera.org
Subject: Hive double-precision question

Hi Hive Users,
I recently noticed an interesting behavior with Hive and I am unable to find the reason
for it. Your insights into this is much appreciated.

I am trying to compute the distance between two zip codes. I have the distances computed in
various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There
are some discrepancies from the 3rd decimal place when I see the output got from using Hive
UDF and Hive's built-in functions. Here is an example:

zip1          zip 2          Hadoop Built-in function    SAS                      R
Linux + Java
00501

11720

4.49493083698542000

4.49508858

4.49508858054005

4.49508857976933000

The formula used to compute distance is this (UDF):

double long1 = Math.atan(1)/45 * ux;
double lat1 = Math.atan(1)/45 * uy;
double long2 = Math.atan(1)/45 * mx;
double lat2 = Math.atan(1)/45 * my;

double X1 = long1;
double Y1 = lat1;
double X2 = long2;
double Y2 = lat2;

double distance = 3949.99 * Math.acos(Math.sin(Y1) *
Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2));

The one used using built-in functions (same as above):
3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
(atan(1)/45) - m_x_coord * (atan(1)/45)) )

- The Hive's built-in functions used are acos, sin, cos and atan.
- for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc)
- All variables used are double.

I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got
from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives
49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The
linux machine is similar to the Hadoop cluster machines.

Linux version - Red Hat 5.5
Java - latest.
Hive - 0.7.1

This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off
occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples:

zip1          zip 2          Hadoop Built-in function    SAS                      R
Linux + Java
00602

00617

42.79095253903410000

42.79072812

42.79072812185650

42.79072812185640000

00603

00617

40.24044016655180000

40.2402289

40.24022889740920

40.24022889740910000

00605

00617

40.19191761288380000

40.19186416

40.19186415807060

40.19186415807060000

I have not tested the individual sin, cos, atan function returns. That will be my next test.
But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins
and that from Linux + Java?  I am assuming that Hive's built-in mathematical functions are
nothing but the underlying Java functions.

Thanks,
PD.

```
Mime
View raw message