Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1EBCAD461 for ; Fri, 7 Dec 2012 23:05:51 +0000 (UTC) Received: (qmail 75815 invoked by uid 500); 7 Dec 2012 23:05:49 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 75778 invoked by uid 500); 7 Dec 2012 23:05:49 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 75769 invoked by uid 99); 7 Dec 2012 23:05:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 23:05:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of periya.data@gmail.com designates 209.85.219.48 as permitted sender) Received: from [209.85.219.48] (HELO mail-oa0-f48.google.com) (209.85.219.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 23:05:44 +0000 Received: by mail-oa0-f48.google.com with SMTP id h2so1141233oag.35 for ; Fri, 07 Dec 2012 15:05:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=NH9IKLA+v69fhLJwcmldO7D0kmLhsnLtsS4IY8dbVtQ=; b=VPS/2G4iwdbqqPuGPOyuACLbMlvX4ZSLQeeVXkUWSfdDJK3By/c8AAU3Uir+LzIWRS h+CV7scQK8ZCWl8kscrAAVB9lRHL8R61m6XfmHgHoL+rZKnB/hVBrTmVyXGiy24Xtmnb qmM3YfVzDyelJYt3SHMH6LKTM22PNbuRD76tM9mwPYYji3sJXAqU6XLlpCKpZutfKbH0 3RXctNcsqwSHXP8Iz6xAXrwwUYGEfSAKL9VSU4PaGQjVd/EEYotqyJh21LoiHLIofjA1 0RchkKhEJROLT08FHYPuTH8bv6tT84PD+YrUNmFUD05JCeECS7Et8M4WqVLJ1xGxkI3h M3Mw== MIME-Version: 1.0 Received: by 10.60.0.199 with SMTP id 7mr4139260oeg.139.1354921523705; Fri, 07 Dec 2012 15:05:23 -0800 (PST) Received: by 10.60.161.233 with HTTP; Fri, 7 Dec 2012 15:05:23 -0800 (PST) In-Reply-To: References: Date: Fri, 7 Dec 2012 15:05:23 -0800 Message-ID: Subject: Re: Hive double-precision question From: "Periya.Data" To: user@hive.apache.org Cc: cdh-user@cloudera.org Content-Type: multipart/alternative; boundary=e89a8fb1f3be1e765704d04b40b3 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1f3be1e765704d04b40b3 Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, Thanks for the pointers. I looked at the code and it looks like my Java code and the Hive code are similar...(I am a basic-level Java guy). The UDF below uses Math.sin....which is what I used to test "linux + Java" result. I have to see what this DoubleWritable and Serde2 is all about... package org.apache.hadoop.hive.ql.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.serde2.io.DoubleWritable; /** * UDFSin. * */ @Description(name = "sin", value = "_FUNC_(x) - returns the sine of x (x is in radians)", extended = "Example:\n " + " > SELECT _FUNC_(0) FROM src LIMIT 1;\n" + " 0") public class UDFSin extends UDF { private DoubleWritable result = new DoubleWritable(); public UDFSin() { } public DoubleWritable evaluate(DoubleWritable a) { if (a == null) { return null; } else { result.set(Math.sin(a.get())); return result; } } } On Fri, Dec 7, 2012 at 2:02 PM, Mark Grover wrote: > Periya: > If you want to see what the built in Hive UDFs are doing, the code is here: > > https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic > and > > https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf > > You can find out which UDF name maps to what class by looking at > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java > > If my memory serves me right, there was some "interesting" stuff Hive does > when mapping Java types to Hive datatypes. I am not sure how relevant it is > to this discussion but I will have to look further to comment more. > > In the meanwhile take a look at the UDF code and see if your personal Java > code on Linux is equivalent to the Hive UDF code. > > Keep us posted! > Mark > > On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data wrote: > >> Hi Hive Users, >> I recently noticed an interesting behavior with Hive and I am unable >> to find the reason for it. Your insights into this is much appreciated. >> >> I am trying to compute the distance between two zip codes. I have the >> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF >> and using Hive's built-in functions. There are some discrepancies from the >> 3rd decimal place when I see the output got from using Hive UDF and Hive's >> built-in functions. Here is an example: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java >> 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 >> 4.49508857976933000 >> The formula used to compute distance is this (UDF): >> >> double long1 = Math.atan(1)/45 * ux; >> double lat1 = Math.atan(1)/45 * uy; >> double long2 = Math.atan(1)/45 * mx; >> double lat2 = Math.atan(1)/45 * my; >> >> double X1 = long1; >> double Y1 = lat1; >> double X2 = long2; >> double Y2 = lat2; >> >> double distance = 3949.99 * Math.acos(Math.sin(Y1) * >> Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 >> - X2)); >> >> >> The one used using built-in functions (same as above): >> 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * >> sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* >> cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * >> (atan(1)/45) - m_x_coord * (atan(1)/45)) ) >> >> >> >> >> - The Hive's built-in functions used are acos, sin, cos and atan. >> - for another try, I used Hive UDF, with Java's math library (Math.acos, >> Math.atan etc) >> - All variables used are double. >> >> I expected the value from Hadoop UDF (and Built-in functions) to be >> identical with that got from plain Java code in Linux. But they are not. >> The built-in function (as well as UDF) gives 49493083698542000 whereas >> simple Java program running in Linux gives 49508857976933000. The linux >> machine is similar to the Hadoop cluster machines. >> >> Linux version - Red Hat 5.5 >> Java - latest. >> Hive - 0.7.1 >> Hadoop - 0.20.2 >> >> This discrepancy is very consistent across thousands of zip-code >> distances. It is not a one-off occurrence. In some cases, I see the >> difference from the 4th decimal place. Some more examples: >> >> zip1 zip 2 Hadoop Built-in function >> SAS R Linux + >> Java >> 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 >> 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 >> 40.24022889740920 40.24022889740910000 00605 00617 >> 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 >> I have not tested the individual sin, cos, atan function returns. That >> will be my next test. But, at the very least, why is there a difference in >> the values between Hadoop's UDF/built-ins and that from Linux + Java? I am >> assuming that Hive's built-in mathematical functions are nothing but the >> underlying Java functions. >> >> Thanks, >> PD. >> >> > --e89a8fb1f3be1e765704d04b40b3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Mark,
=A0=A0 Thanks for the pointers. I looked at the code and it loo= ks like my Java code and the Hive code are similar...(I am a basic-level Ja= va guy). The UDF below uses Math.sin....which is what I used to test "= linux + Java" result.=A0 I have to see what this DoubleWritable and Se= rde2 is all about...

package org.apache.hadoop.hive.ql.udf;

import org.apache.hadoop.= hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
= import org.apache.hadoop.hive.serde2.io.DoubleWritable;

/**
* UDF= Sin.
*
*/
@Description(name =3D "sin",
=A0=A0=A0 value =3D &q= uot;_FUNC_(x) - returns the sine of x (x is in radians)",
=A0=A0=A0= extended =3D "Example:\n "
=A0=A0=A0 + " > SELECT _FU= NC_(0) FROM src LIMIT 1;\n" + " 0")
public class UDFSin extends UDF {
=A0 private DoubleWritable result =3D = new DoubleWritable();

=A0 public UDFSin() {
=A0 }



= public DoubleWritable evaluate(DoubleWritable a) {
=A0=A0=A0 if (a =3D= =3D null) {
=A0=A0=A0=A0=A0 return null;
=A0=A0=A0 } else {
=A0=A0=A0=A0=A0 result.set(Math.sin(a.get()));
=A0= =A0=A0=A0=A0 return result;
=A0=A0=A0 }
=A0 }
}






On Fri, Dec 7, 2012 at 2:02 PM, Mar= k Grover <grover.markgrover@gmail.com> wrote:
Periya:
If you want to see what the buil= t in Hive UDFs are doing, the code is here:


If my memory serves me right, there was some "inte= resting" stuff Hive does when mapping Java types to Hive datatypes. I = am not sure how relevant it is to this discussion but I will have to look f= urther to comment more.

In the meanwhile take a look at the UDF code and see if= your personal Java code on Linux is equivalent to the Hive UDF code.
=

Keep us posted!
Mark

On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data <periya.data@gmail.com> wrote:
Hi Hive Users,
=A0=A0=A0 I recently noticed an interesting behavior with= Hive and I am unable to find the reason for it. Your insights into this is= much appreciated.

I am trying to compute the distance between two z= ip codes. I have the distances computed in various 'platforms' - SA= S, R, Linux+Java, Hive UDF and using Hive's built-in functions. There a= re some discrepancies from the 3rd decimal place when I see the output got = from using Hive UDF and Hive's built-in functions. Here is an example:<= br>
zip1=A0=A0=A0=A0=A0=A0=A0=A0=A0 zip 2=A0=A0=A0=A0=A0=A0=A0=A0=A0 Hadoop= Built-in function=A0=A0=A0 SAS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 R=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Linux + Jav= a
=
00= 501=A0=A0 11720=A0=A0 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000

The formula used to compute distance is this (UDF)= :

=A0=A0=A0 =A0=A0=A0 double long1 =3D Math.atan(1)/45 * ux;
=A0= =A0=A0 =A0=A0=A0 double lat1 =3D Math.atan(1)/45 * uy;
=A0=A0=A0 =A0=A0= =A0 double long2 =3D Math.atan(1)/45 * mx;
=A0=A0=A0 =A0=A0=A0 double lat2 =3D Math.atan(1)/45 * my;
=A0=A0=A0 =A0= =A0=A0
=A0=A0=A0 =A0=A0=A0 double X1 =3D long1;
=A0=A0=A0 =A0=A0=A0 = double Y1 =3D lat1;
=A0=A0=A0 =A0=A0=A0 double X2 =3D long2;
=A0=A0= =A0 =A0=A0=A0 double Y2 =3D lat2;
=A0=A0=A0 =A0=A0=A0
=A0=A0=A0 =A0= =A0=A0 double distance =3D 3949.99 * Math.acos(Math.sin(Y1) *
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 Math.sin(Y2) + Math.cos(Y1) * Math.= cos(Y2) * Math.cos(X1 - X2));


The one used using built-in functi= ons (same as above):
3949.99*acos(=A0 sin(u_y_coord * (atan(1)/45 )) * <= br>=A0=A0=A0 =A0=A0=A0 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (at= an(1)/45 ))*
=A0=A0=A0 =A0=A0=A0 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
=A0= =A0=A0 =A0=A0=A0 (atan(1)/45) - m_x_coord * (atan(1)/45)) )



=
- The Hive's built-in functions used are acos, sin, cos and atan.- for another try, I used Hive UDF, with Java's math library (Math.ac= os, Math.atan etc)
- All variables used are double.

I expected the value from Hadoop UD= F (and Built-in functions) to be identical with that got from plain Java co= de in Linux. But they are not. The built-in function (as well as UDF) gives= 49493083698542000 whereas simple Java program running in Linux gives 49508= 857976933000. The linux machine is similar to the Hadoop cluster machines.<= br>
Linux version - Red Hat 5.5
Java - latest.
Hive - 0.7.1
Hadoop= - 0.20.2

This discrepancy is very consistent across thousands of zi= p-code distances. It is not a one-off occurrence. In some cases, I see the = difference from the 4th decimal place. Some more examples:

zip1=A0=A0=A0=A0=A0=A0=A0=A0=A0 zip 2=A0=A0=A0=A0=A0=A0=A0=A0=A0 Hadoop= Built-in function=A0=A0=A0=20 SAS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 R=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Linux + Java
00= 602=A0=A0 00617=A0=A0 42.79095253903410000 42.79072812 42.79072812185650 42.79072812185640000
00603=A0=A0 00617=A0=A0 40.24044016655180000 40.2402289 40.24022889740920 40.24022889740910000
00605=A0=A0 00617=A0=A0 40.19191761288380000 = 40.19186416 40.19186415807060 40.19186415807060000

I have not tested the individual sin, cos, atan functio= n returns. That will be my next test. But, at the very least, why is there = a difference in the values between Hadoop's UDF/built-ins and that from= Linux + Java?=A0 I am assuming that Hive's built-in mathematical funct= ions are nothing but the underlying Java functions.

Thanks,
PD.



--e89a8fb1f3be1e765704d04b40b3--