hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Periya.Data" <periya.d...@gmail.com>
Subject Re: Hive double-precision question
Date Sat, 08 Dec 2012 19:23:14 GMT
Hi Lauren and Zhang,
    The book "Programming Hive" suggests to use Double (instead of float)
and also to cast any literal value to double. I am already using double for
all my computations (both in hive table schema as well as in my UDF).
Furthermore, I am not comparing two floats/doubles. I am doing some
computations involving doubles...and those minor differences are adding up.

It looks like what Mark Grover was telling - mapping between Java datatypes
to Hive data-types. I am yet to look at that portion of the source-code.

Thanks and will keep you posted,
/PD



On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <Lauren.Yang@microsoft.com>wrote:

>  This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where
> comparing float/doubles will not work because of the way floating point
> numbers are represented.****
>
> ** **
>
> Perhaps there is a comparison between a  float and double type because of
> some internal representation in the Java library, or the UDF.****
>
> ** **
>
> Ed Capriolo’s book has a good section about workarounds and caveats for
> working with floats/doubles in hive.****
>
> ** **
>
> Thanks,****
>
> Lauren****
>
> *From:* Periya.Data [mailto:periya.data@gmail.com]
> *Sent:* Friday, December 07, 2012 1:28 PM
> *To:* user@hive.apache.org; cdh-user@cloudera.org
> *Subject:* Hive double-precision question****
>
> ** **
>
> Hi Hive Users,
>     I recently noticed an interesting behavior with Hive and I am unable
> to find the reason for it. Your insights into this is much appreciated.
>
> I am trying to compute the distance between two zip codes. I have the
> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
> and using Hive's built-in functions. There are some discrepancies from the
> 3rd decimal place when I see the output got from using Hive UDF and Hive's
> built-in functions. Here is an example:
>
> zip1          zip 2          Hadoop Built-in function
> SAS                      R                                       Linux +
> Java****
>
> 00501  ****
>
> 11720  ****
>
> 4.49493083698542000****
>
> 4.49508858****
>
> 4.49508858054005****
>
> 4.49508857976933000****
>
>
> The formula used to compute distance is this (UDF):
>
>         double long1 = Math.atan(1)/45 * ux;
>         double lat1 = Math.atan(1)/45 * uy;
>         double long2 = Math.atan(1)/45 * mx;
>         double lat2 = Math.atan(1)/45 * my;
>
>         double X1 = long1;
>         double Y1 = lat1;
>         double X2 = long2;
>         double Y2 = lat2;
>
>         double distance = 3949.99 * Math.acos(Math.sin(Y1) *
>                 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
> X2));
>
>
> The one used using built-in functions (same as above):
> 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
>         sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
>         cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
>         (atan(1)/45) - m_x_coord * (atan(1)/45)) )
>
>
>
>
> - The Hive's built-in functions used are acos, sin, cos and atan.
> - for another try, I used Hive UDF, with Java's math library (Math.acos,
> Math.atan etc)
> - All variables used are double.
>
> I expected the value from Hadoop UDF (and Built-in functions) to be
> identical with that got from plain Java code in Linux. But they are not.
> The built-in function (as well as UDF) gives 49493083698542000 whereas
> simple Java program running in Linux gives 49508857976933000. The linux
> machine is similar to the Hadoop cluster machines.
>
> Linux version - Red Hat 5.5
> Java - latest.
> Hive - 0.7.1
> Hadoop - 0.20.2
>
> This discrepancy is very consistent across thousands of zip-code
> distances. It is not a one-off occurrence. In some cases, I see the
> difference from the 4th decimal place. Some more examples:
>
> zip1          zip 2          Hadoop Built-in function
> SAS                      R                                       Linux +
> Java****
>
> 00602  ****
>
> 00617  ****
>
> 42.79095253903410000****
>
> 42.79072812****
>
> 42.79072812185650****
>
> 42.79072812185640000****
>
> 00603  ****
>
> 00617  ****
>
> 40.24044016655180000****
>
> 40.2402289****
>
> 40.24022889740920****
>
> 40.24022889740910000****
>
> 00605  ****
>
> 00617  ****
>
> 40.19191761288380000****
>
> 40.19186416****
>
> 40.19186415807060****
>
> 40.19186415807060000****
>
>
> I have not tested the individual sin, cos, atan function returns. That
> will be my next test. But, at the very least, why is there a difference in
> the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
> assuming that Hive's built-in mathematical functions are nothing but the
> underlying Java functions.
>
> Thanks,
> PD.****
>

Mime
View raw message