Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9EF5697CD for ; Tue, 7 Aug 2012 18:08:48 +0000 (UTC) Received: (qmail 91453 invoked by uid 500); 7 Aug 2012 18:08:46 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 91370 invoked by uid 500); 7 Aug 2012 18:08:46 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 91353 invoked by uid 99); 7 Aug 2012 18:08:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2012 18:08:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jamalraihan@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vb0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2012 18:08:38 +0000 Received: by vbjk17 with SMTP id k17so4828888vbj.35 for ; Tue, 07 Aug 2012 11:08:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=pRY1wsrXs+eWIH1OBwvwaQbwKjjocMN63LOj8vlLQeM=; b=mNZ1Ic6+tvYiJHU9ejvnpP29gyozDwJmkod7OAkEBDTrXUGMUE2zAk6y5xWlirTkv6 HD4BCIwA6ExnMWWVPlqRqoB3kQ9O8XFeLwpbtQmiN26zX5Y5TpzR60w8tRNnO7F+0KW7 LdtXMihc7ZHHSrK7Xw1uuyYCXoUn+jCAoIJR/nWadn+KXm/I3qgBMj+mvvia5tZLRKso VYmV31tYTpg9QZk/eMFUtMOCBz3GL//B9zPK/O66Y56C8WR+6jlbpsogSIx71t0squ08 eByprHpWAks4nBaR0EpsbRYIlVTCMWwyL6rVW/v33LFSb/d2PK4mjCRCDLN5vLCiGzjr CvOg== Received: by 10.58.182.35 with SMTP id eb3mr13281706vec.42.1344362897288; Tue, 07 Aug 2012 11:08:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.249.33 with HTTP; Tue, 7 Aug 2012 11:07:57 -0700 (PDT) In-Reply-To: References: From: Raihan Jamal Date: Tue, 7 Aug 2012 11:07:57 -0700 Message-ID: Subject: Re: Custom UserDefinedFunction in Hive To: user@hive.apache.org Cc: dev@hive.apache.org Content-Type: multipart/alternative; boundary=047d7b6d9792f12ddf04c6b0e0b0 --047d7b6d9792f12ddf04c6b0e0b0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Jan, I have date in different format also, so that is the reason I was thinking to do by this approach. How can I make sure this will work on the selected partition only and it will not scan the entire table. I will add your suggestion in my UDF as deterministic thing. My simple question here is- How to get the Yesterdays date which I can use on the Date Partition I cannot use hiveconf here as I am working with Hive 0.6 *Raihan Jamal* On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolin=E1r wrote: > I'm afraid that he query > > SELECT * FROM REALTIME where dt=3D yesterdaydate('yyyyMMdd') LIMIT 10; > > will scan entire table, because the functions is evaluated at runtime, so > Hive doesn't know what the value is when it decides which files to scan. = I > am not 100% sure though, you should try it. > > Also, you might want to try to add annotation to your UDF saying that the > function is deterministic: > @*UDFType(deterministic=3Dfalse*) > > I think Hive might be clever enough to evaluate it early enough to use th= e > partition pruning correctly, since it operates on constant expression. Bu= t > again, I'm not really sure, maybe someone with deeper knowledge of Hive > optimizations will tell us more. It is actually quite interesting questio= n. > > Another way to help Hive with the optimizations might be to skip passing > the format string argument, if you have all dates in same format, you can > call the function just like 'yesterdaydate()' and hardcode the format in > the function. > > Jan > > > On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal wrote= : > >> Hi Jan, >> >> >> I figured that out, it is working fine for me now. The only question I >> have is, if I am doing like this- >> >> >> >> SELECT * FROM REALTIME where dt=3D yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> Then the above query will be evaluated as below right? >> >> >> >> SELECT * FROM REALTIME where dt=3D =9120120806=92 LIMIT 10; >> >> >> >> So that means it will look for data in the corresponding dt partition *(= 20120806) >> *only right as above table is partitioned on dt column ? And it will not >> scan the whole table right?** >> >> >> >> *Raihan Jamal* >> >> >> >> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolin=E1r wro= te: >> >>> Hi Jamal, >>> >>> Check if the function really returns what it should and that your data >>> are really in yyyyMMdd format. You can do this by simple query like thi= s: >>> >>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>> >>> I don't see anything wrong with the function itself, it works well for >>> me (although I tested it in hive 0.7.1). The only thing I would change >>> about it would be to optimize it by calling 'new' only at the time of >>> construction and reusing the object when the function is called, but th= at >>> should not affect the functionality at all. >>> >>> Best regards, >>> Jan >>> >>> >>> >>> >>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal wro= te: >>> >>>> *Problem* >>>> >>>> I created the below UserDefinedFunction to get the yesterday's day in >>>> the format I wanted as I will be passing the format into this below me= thod >>>> from the query. >>>> >>>> >>>> >>>> *public final class YesterdayDate extends UDF {* >>>> >>>> * * >>>> >>>> * public String evaluate(final String format) { * >>>> >>>> * DateFormat dateFormat =3D new >>>> SimpleDateFormat(format); * >>>> >>>> * Calendar cal =3D Calendar.getInstance= (); >>>> * >>>> >>>> * cal.add(Calendar.DATE, -1); * >>>> >>>> * return >>>> dateFormat.format(cal.getTime()).toString(); * >>>> >>>> * } * >>>> >>>> *}* >>>> >>>> >>>> >>>> >>>> >>>> So whenever I try to run the query like below by adding the jar to >>>> classpath and creating the temporary function yesterdaydate, I always = get >>>> zero result back- >>>> >>>> >>>> >>>> hive> create temporary function *yesterdaydate* as >>>> 'com.example.hive.udf.YesterdayDate'; >>>> >>>> OK >>>> >>>> Time taken: 0.512 seconds >>>> >>>> >>>> >>>> Below is the query I am running- >>>> >>>> >>>> >>>> *hive> SELECT * FROM REALTIME where dt=3D yesterdaydate('yyyyMMdd') >>>> LIMIT 10;* >>>> >>>> *OK* >>>> >>>> * * >>>> >>>> And I always get zero result back but the data is there in that table >>>> for Aug 5th.** >>>> >>>> >>>> >>>> What wrong I am doing? Any suggestions will be appreciated. >>>> >>>> >>>> >>>> >>>> >>>> NOTE:- As I am working with Hive 0.6 so it doesn=92t support variable >>>> substitution thing, so I cannot use hiveconf here and the above table = has >>>> been partitioned on dt(date) column.** >>>> >>> >>> >> > --047d7b6d9792f12ddf04c6b0e0b0 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Hi Jan,

I have date in different format= also, so that is the reason I was thinking to do by this approach. How can= I make sure this will work on the selected partition only and it will not = scan the entire table. I will add your suggestion in my UDF as deterministi= c thing.

My simple question here is- How to get the Yesterdays d= ate which I can use on the Date Partition I cannot use hiveconf here as I a= m working with Hive 0.6




Raihan Jamal



On Tue, Aug 7, 2012 at 10:37 AM, Jan Dol= in=E1r <dolik.rce@gmail.com> wrote:
I'm afraid that =A0he query

SELECT * FROM REALTIME where d= t=3D yesterdaydate('yyyyMMdd') LIMIT 10;

will scan entire table, because the functions is evaluated a= t runtime, so Hive doesn't know what the value is when it decides which= files to scan. I am not 100% sure though, you should try it.=A0

Also, you might want to try to add annotation to your UDF saying= that the function is deterministic:
@UDFType(deterministic=3Dfalse<= /em>)<= /div>

I think Hive might be clever enough to evaluate it earl= y enough to use the partition pruning correctly, since it operates on const= ant expression. But again, I'm not really sure, maybe someone with deep= er knowledge of Hive optimizations will tell us more. It is actually quite = interesting question.

Another way to help Hive with the optimizations might b= e to skip passing the format string argument, if you have all dates in same= format, you can call the function just like 'yesterdaydate()' and = hardcode the format in the function.

Jan

=

On Tue, Aug 7, 2012 at 7:20 PM, R= aihan Jamal <jamalraihan@gmail.com> wrote:

Hi Jan,

=A0

I figured that out, it is working fine for me now.=A0The only question I have is, if I am doing like this-

=A0

SELECT * FROM REALTIME wh= ere dt=3D yesterdaydate('yyyyMMdd') LIMIT 10;

=A0

Then the above quer= y will be evaluated as below right?

=A0

SELECT * FROM REALTIME wh= ere dt=3D =9120120806=92 LIMIT 10;

=A0

So that means it will loo= k for data in the corresponding dt partition (20120806) only right as above table is partitioned on dt = column ? And it will not scan the whole table right?




Raihan Jamal



On Mon, Aug 6, 2012 a= t 10:56 PM, Jan Dolin=E1r <dolik.rce@gmail.com> wrote:
Hi Jamal,

Check if the function really returns what it s= hould and that your data are really in yyyyMMdd format. You can do this by = simple query like this:

SELECT dt,=A0yesterdaydate('yyyyM= Mdd')=A0FROM REALTIME LIMIT 1;
I don't see anything wrong with the function itself= , it works well for me (although I tested it in hive 0.7.1). The only thing= I would change about it would be to optimize it by calling 'new' o= nly at the time of construction and reusing the object when the function is= called, but that should not affect the functionality at all.=
Best regards,
Ja= n




On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <<= a href=3D"mailto:jamalraihan@gmail.com" target=3D"_blank">jamalraihan@gmail= .com> wrote:

Problem

I created the below UserD= efinedFunction to get the yesterday's day in the format I wanted as I will be passing the format into this below method from the query.

=A0

public final class Yes= terdayDate extends UDF {

=A0

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 public String evaluate(final String format) {

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 DateFormat dateFormat =3D new SimpleDateFormat(format);

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Calendar cal =3D Calendar.getInstance();

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 cal.add(Calendar.DATE, -1);=A0=A0=A0=A0

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 return dateFormat.format(cal.getTime()).toString();

=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0 }

}

=A0

=A0

So whenever I try to run = the query like below by adding the jar to classpath and creating the temporary function yesterdaydate, I always ge= t zero result back-

=A0

hive> create temporary= function yesterdaydate as 'com.example.hive.udf.YesterdayDate';

OK

Time taken: 0.512 seconds=

=A0

Below is the query I am r= unning-

=A0

hive> SELECT * FROM= REALTIME where dt=3D yesterdaydate('yyyyMMdd') LIMIT 10;

OK

=A0

And I always get zero res= ult back but the data is there in that table for Aug 5th.

=A0

What wrong I am doing? An= y suggestions will be appreciated.

=A0

=A0

NOTE:- As I am working wi= th Hive 0.6 so it doesn=92t support variable substitution thing, so I cannot use hiveconf here and the above ta= ble has been partitioned on dt(date) column.





--047d7b6d9792f12ddf04c6b0e0b0--