hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raihan Jamal <jamalrai...@gmail.com>
Subject Re: Custom UserDefinedFunction in Hive
Date Tue, 07 Aug 2012 18:24:06 GMT
Yes it supports -e option, but in your query what is date?

hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
AS 'com.example.hive.udf.YesterdayDate';
SELECT * FROM REALTIME where dt=$(*date* -d -1day +%Y%m%d) LIMIT 10;"



*Raihan Jamal*



On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár <dolik.rce@gmail.com> wrote:

> By the way, even without hiveconf, you can run hive from shell like this
> to achieve what you want using shell capabilities:
>
> hive -e "CREATE TEMPORARY FUNCTION yesterdaydate
> AS 'com.example.hive.udf.YesterdayDate';
> SELECT * FROM REALTIME where dt=$(date -d -1day +%Y%m%d) LIMIT 10;"
>
> At least if hive 6.0 supports -e option, I don't have where to check that.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 8:11 PM, Vijay <techvd@gmail.com> wrote:
>
>> Given the implementation of the UDF, I don't think hive would be able
>> to use partition pruning. Especially the version you're using. I'd
>> really recommend upgrading to a later version that has the hiveconf
>> support. That can save a lot of trouble rather than trying to get
>> things working on 0.6
>>
>> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <jamalraihan@gmail.com>
>> wrote:
>> > Hi Jan,
>> >
>> > I have date in different format also, so that is the reason I was
>> thinking
>> > to do by this approach. How can I make sure this will work on the
>> selected
>> > partition only and it will not scan the entire table. I will add your
>> > suggestion in my UDF as deterministic thing.
>> >
>> > My simple question here is- How to get the Yesterdays date which I can
>> use
>> > on the Date Partition I cannot use hiveconf here as I am working with
>> Hive
>> > 0.6
>> >
>> >
>> >
>> >
>> > Raihan Jamal
>> >
>> >
>> >
>> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <dolik.rce@gmail.com>
>> wrote:
>> >>
>> >> I'm afraid that  he query
>> >>
>> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>
>> >> will scan entire table, because the functions is evaluated at runtime,
>> so
>> >> Hive doesn't know what the value is when it decides which files to
>> scan. I
>> >> am not 100% sure though, you should try it.
>> >>
>> >> Also, you might want to try to add annotation to your UDF saying that
>> the
>> >> function is deterministic:
>> >> @UDFType(deterministic=false)
>> >>
>> >> I think Hive might be clever enough to evaluate it early enough to use
>> the
>> >> partition pruning correctly, since it operates on constant expression.
>> But
>> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> >> optimizations will tell us more. It is actually quite interesting
>> question.
>> >>
>> >> Another way to help Hive with the optimizations might be to skip
>> passing
>> >> the format string argument, if you have all dates in same format, you
>> can
>> >> call the function just like 'yesterdaydate()' and hardcode the format
>> in the
>> >> function.
>> >>
>> >> Jan
>> >>
>> >>
>> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalraihan@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Jan,
>> >>>
>> >>>
>> >>>
>> >>> I figured that out, it is working fine for me now. The only question
I
>> >>> have is, if I am doing like this-
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> Then the above query will be evaluated as below right?
>> >>>
>> >>>
>> >>>
>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>> >>>
>> >>>
>> >>>
>> >>> So that means it will look for data in the corresponding dt partition
>> >>> (20120806) only right as above table is partitioned on dt column ?
>> And it
>> >>> will not scan the whole table right?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Raihan Jamal
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik.rce@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hi Jamal,
>> >>>>
>> >>>> Check if the function really returns what it should and that your
>> data
>> >>>> are really in yyyyMMdd format. You can do this by simple query like
>> this:
>> >>>>
>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>> >>>>
>> >>>> I don't see anything wrong with the function itself, it works well
>> for
>> >>>> me (although I tested it in hive 0.7.1). The only thing I would
>> change about
>> >>>> it would be to optimize it by calling 'new' only at the time of
>> construction
>> >>>> and reusing the object when the function is called, but that should
>> not
>> >>>> affect the functionality at all.
>> >>>>
>> >>>> Best regards,
>> >>>> Jan
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalraihan@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Problem
>> >>>>>
>> >>>>> I created the below UserDefinedFunction to get the yesterday's
day
>> in
>> >>>>> the format I wanted as I will be passing the format into this
below
>> method
>> >>>>> from the query.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> public final class YesterdayDate extends UDF {
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>                 public String evaluate(final String format)
{
>> >>>>>
>> >>>>>                                 DateFormat dateFormat = new
>> >>>>> SimpleDateFormat(format);
>> >>>>>
>> >>>>>                                 Calendar cal =
>> Calendar.getInstance();
>> >>>>>
>> >>>>>                                 cal.add(Calendar.DATE, -1);
>> >>>>>
>> >>>>>                                 return
>> >>>>> dateFormat.format(cal.getTime()).toString();
>> >>>>>
>> >>>>>                 }
>> >>>>>
>> >>>>> }
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> So whenever I try to run the query like below by adding the
jar to
>> >>>>> classpath and creating the temporary function yesterdaydate,
I
>> always get
>> >>>>> zero result back-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> create temporary function yesterdaydate as
>> >>>>> 'com.example.hive.udf.YesterdayDate';
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>> Time taken: 0.512 seconds
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Below is the query I am running-
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>> LIMIT
>> >>>>> 10;
>> >>>>>
>> >>>>> OK
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> And I always get zero result back but the data is there in that
>> table
>> >>>>> for Aug 5th.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> What wrong I am doing? Any suggestions will be appreciated.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support
variable
>> >>>>> substitution thing, so I cannot use hiveconf here and the above
>> table has
>> >>>>> been partitioned on dt(date) column.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Mime
View raw message