Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of periya.data@gmail.com
 designates 209.85.219.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKm=R7UGPpOsfRuxvdZZxdMk4q0ZbjSbFVJv1wOS3P8YV3dJ8A@mail.gmail.com>
References: 
 <CACO1OL-APuOj_8u53AwDypwRiFYafoG-JkBrZ-EcGuxOn=39sw@mail.gmail.com>
	<CAKm=R7UGPpOsfRuxvdZZxdMk4q0ZbjSbFVJv1wOS3P8YV3dJ8A@mail.gmail.com>
Date: Fri, 7 Dec 2012 15:05:23 -0800
Message-ID: 
 <CACO1OL-mfx1Ku3gEEPoM=QXefcdx26QX+9uQRxPqmzTx1ONHQA@mail.gmail.com>
Subject: Re: Hive double-precision question
From: "Periya.Data" <periya.data@gmail.com>
To: user@hive.apache.org
Cc: cdh-user@cloudera.org
Content-Type: multipart/alternative; boundary=e89a8fb1f3be1e765704d04b40b3

--e89a8fb1f3be1e765704d04b40b3
Content-Type: text/plain; charset=ISO-8859-1

Hi Mark,
   Thanks for the pointers. I looked at the code and it looks like my Java
code and the Hive code are similar...(I am a basic-level Java guy). The UDF
below uses Math.sin....which is what I used to test "linux + Java" result.
I have to see what this DoubleWritable and Serde2 is all about...

package org.apache.hadoop.hive.ql.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;

/**
* UDFSin.
*
*/
@Description(name = "sin",
    value = "_FUNC_(x) - returns the sine of x (x is in radians)",
    extended = "Example:\n "
    + " > SELECT _FUNC_(0) FROM src LIMIT 1;\n" + " 0")
public class UDFSin extends UDF {
  private DoubleWritable result = new DoubleWritable();

  public UDFSin() {
  }


public DoubleWritable evaluate(DoubleWritable a) {
    if (a == null) {
      return null;
    } else {
      result.set(Math.sin(a.get()));
      return result;
    }
  }
}


On Fri, Dec 7, 2012 at 2:02 PM, Mark Grover <grover.markgrover@gmail.com>wrote:

> Periya:
> If you want to see what the built in Hive UDFs are doing, the code is here:
>
> https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic
> and
>
> https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf
>
> You can find out which UDF name maps to what class by looking at
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
>
> If my memory serves me right, there was some "interesting" stuff Hive does
> when mapping Java types to Hive datatypes. I am not sure how relevant it is
> to this discussion but I will have to look further to comment more.
>
> In the meanwhile take a look at the UDF code and see if your personal Java
> code on Linux is equivalent to the Hive UDF code.
>
> Keep us posted!
> Mark
>
> On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <periya.data@gmail.com> wrote:
>
>> Hi Hive Users,
>>     I recently noticed an interesting behavior with Hive and I am unable
>> to find the reason for it. Your insights into this is much appreciated.
>>
>> I am trying to compute the distance between two zip codes. I have the
>> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
>> and using Hive's built-in functions. There are some discrepancies from the
>> 3rd decimal place when I see the output got from using Hive UDF and Hive's
>> built-in functions. Here is an example:
>>
>> zip1          zip 2          Hadoop Built-in function
>> SAS                      R                                       Linux +
>> Java
>> 00501   11720   4.49493083698542000 4.49508858 4.49508858054005
>> 4.49508857976933000
>> The formula used to compute distance is this (UDF):
>>
>>         double long1 = Math.atan(1)/45 * ux;
>>         double lat1 = Math.atan(1)/45 * uy;
>>         double long2 = Math.atan(1)/45 * mx;
>>         double lat2 = Math.atan(1)/45 * my;
>>
>>         double X1 = long1;
>>         double Y1 = lat1;
>>         double X2 = long2;
>>         double Y2 = lat2;
>>
>>         double distance = 3949.99 * Math.acos(Math.sin(Y1) *
>>                 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1
>> - X2));
>>
>>
>> The one used using built-in functions (same as above):
>> 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
>>         sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
>>         cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
>>         (atan(1)/45) - m_x_coord * (atan(1)/45)) )
>>
>>
>>
>>
>> - The Hive's built-in functions used are acos, sin, cos and atan.
>> - for another try, I used Hive UDF, with Java's math library (Math.acos,
>> Math.atan etc)
>> - All variables used are double.
>>
>> I expected the value from Hadoop UDF (and Built-in functions) to be
>> identical with that got from plain Java code in Linux. But they are not.
>> The built-in function (as well as UDF) gives 49493083698542000 whereas
>> simple Java program running in Linux gives 49508857976933000. The linux
>> machine is similar to the Hadoop cluster machines.
>>
>> Linux version - Red Hat 5.5
>> Java - latest.
>> Hive - 0.7.1
>> Hadoop - 0.20.2
>>
>> This discrepancy is very consistent across thousands of zip-code
>> distances. It is not a one-off occurrence. In some cases, I see the
>> difference from the 4th decimal place. Some more examples:
>>
>> zip1          zip 2          Hadoop Built-in function
>> SAS                      R                                       Linux +
>> Java
>>    00602   00617   42.79095253903410000 42.79072812 42.79072812185650
>> 42.79072812185640000  00603   00617   40.24044016655180000 40.2402289
>> 40.24022889740920 40.24022889740910000  00605   00617
>> 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000
>> I have not tested the individual sin, cos, atan function returns. That
>> will be my next test. But, at the very least, why is there a difference in
>> the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
>> assuming that Hive's built-in mathematical functions are nothing but the
>> underlying Java functions.
>>
>> Thanks,
>> PD.
>>
>>
>

--e89a8fb1f3be1e765704d04b40b3
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Mark,<br>=A0=A0 Thanks for the pointers. I looked at the code and it loo=
ks like my Java code and the Hive code are similar...(I am a basic-level Ja=
va guy). The UDF below uses Math.sin....which is what I used to test &quot;=
linux + Java&quot; result.=A0 I have to see what this DoubleWritable and Se=
rde2 is all about...<br>
<br>package org.apache.hadoop.hive.ql.udf;<br><br>import org.apache.hadoop.=
hive.ql.exec.Description;<br>import org.apache.hadoop.hive.ql.exec.UDF;<br>=
import org.apache.hadoop.hive.serde2.io.DoubleWritable;<br><br>/**<br>* UDF=
Sin.<br>
*<br>*/<br>@Description(name =3D &quot;sin&quot;,<br>=A0=A0=A0 value =3D &q=
uot;_FUNC_(x) - returns the sine of x (x is in radians)&quot;,<br>=A0=A0=A0=
 extended =3D &quot;Example:\n &quot;<br>=A0=A0=A0 + &quot; &gt; SELECT _FU=
NC_(0) FROM src LIMIT 1;\n&quot; + &quot; 0&quot;)<br>
public class UDFSin extends UDF {<br>=A0 private DoubleWritable result =3D =
new DoubleWritable();<br><br>=A0 public UDFSin() {<br>=A0 }<br><br><br><br>=
public DoubleWritable evaluate(DoubleWritable a) {<br>=A0=A0=A0 if (a =3D=
=3D null) {<br>=A0=A0=A0=A0=A0 return null;<br>
=A0=A0=A0 } else {<br>=A0=A0=A0=A0=A0 result.set(Math.sin(a.get()));<br>=A0=
=A0=A0=A0=A0 return result;<br>=A0=A0=A0 }<br>=A0 }<br>}<br><br><br><br><br=
><br><br><br><div class=3D"gmail_quote">On Fri, Dec 7, 2012 at 2:02 PM, Mar=
k Grover <span dir=3D"ltr">&lt;<a href=3D"mailto:grover.markgrover@gmail.co=
m" target=3D"_blank">grover.markgrover@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Periya:<div>If you want to see what the buil=
t in Hive UDFs are doing, the code is here:</div><div><a href=3D"https://gi=
thub.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/g=
eneric" target=3D"_blank">https://github.com/apache/hive/tree/trunk/ql/src/=
java/org/apache/hadoop/hive/ql/udf/generic</a><br>

and</div><div><a href=3D"https://github.com/apache/hive/tree/trunk/ql/src/j=
ava/org/apache/hadoop/hive/ql/udf" target=3D"_blank">https://github.com/apa=
che/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf</a></div><div=
>
<br></div><div>
You can find out which UDF name maps to what class by looking at=A0<a href=
=3D"https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop=
/hive/ql/exec/FunctionRegistry.java" target=3D"_blank">https://github.com/a=
pache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRe=
gistry.java</a></div>

<div><br></div><div>If my memory serves me right, there was some &quot;inte=
resting&quot; stuff Hive does when mapping Java types to Hive datatypes. I =
am not sure how relevant it is to this discussion but I will have to look f=
urther to comment more.</div>

<div><br></div><div>In the meanwhile take a look at the UDF code and see if=
 your personal Java code on Linux is equivalent to the Hive UDF code.</div>=
<div><br></div><div>Keep us posted!</div><div><span class=3D"HOEnZb"><font =
color=3D"#888888">Mark<br>
<br></font></span><div class=3D"gmail_quote"><div class=3D"im">
On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <span dir=3D"ltr">&lt;<a href=
=3D"mailto:periya.data@gmail.com" target=3D"_blank">periya.data@gmail.com</=
a>&gt;</span> wrote:<br></div><div><div class=3D"h5"><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">

Hi Hive Users,<br>=A0=A0=A0 I recently noticed an interesting behavior with=
 Hive and I am unable to find the reason for it. Your insights into this is=
 much appreciated.<br><br>I am trying to compute the distance between two z=
ip codes. I have the distances computed in various &#39;platforms&#39; - SA=
S, R, Linux+Java, Hive UDF and using Hive&#39;s built-in functions. There a=
re some discrepancies from the 3rd decimal place when I see the output got =
from using Hive UDF and Hive&#39;s built-in functions. Here is an example:<=
br>


<br>zip1=A0=A0=A0=A0=A0=A0=A0=A0=A0 zip 2=A0=A0=A0=A0=A0=A0=A0=A0=A0 Hadoop=
 Built-in function=A0=A0=A0 SAS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0 R=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Linux + Jav=
a<br><table border=3D"0" cellpadding=3D"0" cellspacing=3D"0" width=3D"690">=
<colgroup><col span=3D"2" width=3D"64"><col width=3D"147"><col width=3D"105=
"><col width=3D"155"><col width=3D"155"></colgroup><tbody><tr height=3D"21"=
>


  <td style=3D"min-height:15.75pt;width:48pt" height=3D"21" width=3D"64">00=
501=A0=A0</td>
  <td style=3D"width:48pt" width=3D"64">11720=A0=A0</td>
  <td style=3D"width:110pt" width=3D"147">4.49493083698542000</td>
  <td style=3D"width:79pt" width=3D"105">4.49508858</td>
  <td style=3D"width:116pt" width=3D"155">4.49508858054005</td>
  <td style=3D"width:116pt" width=3D"155">4.49508857976933000</td>
</tr></tbody></table><br>The formula used to compute distance is this (UDF)=
: <br><br>=A0=A0=A0 =A0=A0=A0 double long1 =3D Math.atan(1)/45 * ux;<br>=A0=
=A0=A0 =A0=A0=A0 double lat1 =3D Math.atan(1)/45 * uy;<br>=A0=A0=A0 =A0=A0=
=A0 double long2 =3D Math.atan(1)/45 * mx;<br>


=A0=A0=A0 =A0=A0=A0 double lat2 =3D Math.atan(1)/45 * my;<br>=A0=A0=A0 =A0=
=A0=A0 <br>=A0=A0=A0 =A0=A0=A0 double X1 =3D long1;<br>=A0=A0=A0 =A0=A0=A0 =
double Y1 =3D lat1;<br>=A0=A0=A0 =A0=A0=A0 double X2 =3D long2;<br>=A0=A0=
=A0 =A0=A0=A0 double Y2 =3D lat2;<br>=A0=A0=A0 =A0=A0=A0 <br>=A0=A0=A0 =A0=
=A0=A0 double distance =3D 3949.99 * Math.acos(Math.sin(Y1) *<br>


=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 Math.sin(Y2) + Math.cos(Y1) * Math.=
cos(Y2) * Math.cos(X1 - X2));<br><br><br>The one used using built-in functi=
ons (same as above):<br>3949.99*acos(=A0 sin(u_y_coord * (atan(1)/45 )) * <=
br>=A0=A0=A0 =A0=A0=A0 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (at=
an(1)/45 ))* <br>


=A0=A0=A0 =A0=A0=A0 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * <br>=A0=
=A0=A0 =A0=A0=A0 (atan(1)/45) - m_x_coord * (atan(1)/45)) )<br><br><br><br>=
<br>- The Hive&#39;s built-in functions used are acos, sin, cos and atan.<b=
r>- for another try, I used Hive UDF, with Java&#39;s math library (Math.ac=
os, Math.atan etc)<br>


- All variables used are double.<br><br>I expected the value from Hadoop UD=
F (and Built-in functions) to be identical with that got from plain Java co=
de in Linux. But they are not. The built-in function (as well as UDF) gives=
 49493083698542000 whereas simple Java program running in Linux gives 49508=
857976933000. The linux machine is similar to the Hadoop cluster machines.<=
br>


<br>Linux version - Red Hat 5.5<br>Java - latest.<br>Hive - 0.7.1<br>Hadoop=
 - 0.20.2<br><br>This discrepancy is very consistent across thousands of zi=
p-code distances. It is not a one-off occurrence. In some cases, I see the =
difference from the 4th decimal place. Some more examples:<br>


<br>zip1=A0=A0=A0=A0=A0=A0=A0=A0=A0 zip 2=A0=A0=A0=A0=A0=A0=A0=A0=A0 Hadoop=
 Built-in function=A0=A0=A0=20
SAS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 R=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Linux +
 Java<br>
 <table border=3D"0" cellpadding=3D"0" cellspacing=3D"0" width=3D"690"><col=
group><col style=3D"width:48pt" span=3D"2" width=3D"64">
 <col style=3D"width:110pt" width=3D"147">
 <col style=3D"width:79pt" width=3D"105">
 <col style=3D"width:116pt" width=3D"155">
 <col style=3D"width:116pt" width=3D"155">
 </colgroup><tbody><tr style=3D"min-height:15.75pt" height=3D"21">
  <td style=3D"min-height:15.75pt;width:48pt" height=3D"21" width=3D"64">00=
602=A0=A0</td>
  <td style=3D"width:48pt" width=3D"64">00617=A0=A0</td>
  <td style=3D"width:110pt" width=3D"147">42.79095253903410000</td>
  <td style=3D"width:79pt" width=3D"105">42.79072812</td>
  <td style=3D"width:116pt" width=3D"155">42.79072812185650</td>
  <td style=3D"width:116pt" width=3D"155">42.79072812185640000</td>
 </tr>
 <tr style=3D"min-height:15.75pt" height=3D"21">
  <td style=3D"min-height:15.75pt" height=3D"21">00603=A0=A0</td>
  <td>00617=A0=A0</td>
  <td>40.24044016655180000</td>
  <td>40.2402289</td>
  <td>40.24022889740920</td>
  <td>40.24022889740910000</td>
 </tr>
 <tr style=3D"min-height:15.75pt" height=3D"21">
  <td style=3D"min-height:15.75pt" height=3D"21">00605=A0=A0</td>
  <td>00617=A0=A0</td>
  <td>40.19191761288380000</td>
  <td><a href=3D"tel:40.19186416" value=3D"+14019186416" target=3D"_blank">=
40.19186416</a></td>
  <td>40.19186415807060</td>
  <td>40.19186415807060000</td>
 </tr>
</tbody></table><br>I have not tested the individual sin, cos, atan functio=
n returns. That will be my next test. But, at the very least, why is there =
a difference in the values between Hadoop&#39;s UDF/built-ins and that from=
 Linux + Java?=A0 I am assuming that Hive&#39;s built-in mathematical funct=
ions are nothing but the underlying Java functions.<br>


<br>Thanks,<br>PD.<br><br>
</blockquote></div></div></div><br></div>
</blockquote></div><br>

--e89a8fb1f3be1e765704d04b40b3--