hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raihan Jamal <jamalrai...@gmail.com>
Subject Re: Custom UserDefinedFunction in Hive
Date Tue, 07 Aug 2012 18:26:14 GMT
Hi Vijay,

Thanks for the suggestion,  If upgrading to Hive was under my control then
I would have done for sure, but I am working in a company and they are
running Hive 0.6 on all the cluster, And I told them to upgrade the Hive
version but they said it will take few months for them to do this. And I
don't know why they are saying like this, so that is the reason I was doing
like this.

Any suggestions will be appreciated to make this thing work


*Raihan Jamal*



On Tue, Aug 7, 2012 at 11:11 AM, Vijay <techvd@gmail.com> wrote:

> Given the implementation of the UDF, I don't think hive would be able
> to use partition pruning. Especially the version you're using. I'd
> really recommend upgrading to a later version that has the hiveconf
> support. That can save a lot of trouble rather than trying to get
> things working on 0.6
>
> On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <jamalraihan@gmail.com>
> wrote:
> > Hi Jan,
> >
> > I have date in different format also, so that is the reason I was
> thinking
> > to do by this approach. How can I make sure this will work on the
> selected
> > partition only and it will not scan the entire table. I will add your
> > suggestion in my UDF as deterministic thing.
> >
> > My simple question here is- How to get the Yesterdays date which I can
> use
> > on the Date Partition I cannot use hiveconf here as I am working with
> Hive
> > 0.6
> >
> >
> >
> >
> > Raihan Jamal
> >
> >
> >
> > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <dolik.rce@gmail.com>
> wrote:
> >>
> >> I'm afraid that  he query
> >>
> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>
> >> will scan entire table, because the functions is evaluated at runtime,
> so
> >> Hive doesn't know what the value is when it decides which files to
> scan. I
> >> am not 100% sure though, you should try it.
> >>
> >> Also, you might want to try to add annotation to your UDF saying that
> the
> >> function is deterministic:
> >> @UDFType(deterministic=false)
> >>
> >> I think Hive might be clever enough to evaluate it early enough to use
> the
> >> partition pruning correctly, since it operates on constant expression.
> But
> >> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> >> optimizations will tell us more. It is actually quite interesting
> question.
> >>
> >> Another way to help Hive with the optimizations might be to skip passing
> >> the format string argument, if you have all dates in same format, you
> can
> >> call the function just like 'yesterdaydate()' and hardcode the format
> in the
> >> function.
> >>
> >> Jan
> >>
> >>
> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalraihan@gmail.com>
> >> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>>
> >>>
> >>> I figured that out, it is working fine for me now. The only question I
> >>> have is, if I am doing like this-
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
> >>>
> >>>
> >>>
> >>> Then the above query will be evaluated as below right?
> >>>
> >>>
> >>>
> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
> >>>
> >>>
> >>>
> >>> So that means it will look for data in the corresponding dt partition
> >>> (20120806) only right as above table is partitioned on dt column ? And
> it
> >>> will not scan the whole table right?
> >>>
> >>>
> >>>
> >>>
> >>> Raihan Jamal
> >>>
> >>>
> >>>
> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik.rce@gmail.com>
> wrote:
> >>>>
> >>>> Hi Jamal,
> >>>>
> >>>> Check if the function really returns what it should and that your data
> >>>> are really in yyyyMMdd format. You can do this by simple query like
> this:
> >>>>
> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
> >>>>
> >>>> I don't see anything wrong with the function itself, it works well for
> >>>> me (although I tested it in hive 0.7.1). The only thing I would
> change about
> >>>> it would be to optimize it by calling 'new' only at the time of
> construction
> >>>> and reusing the object when the function is called, but that should
> not
> >>>> affect the functionality at all.
> >>>>
> >>>> Best regards,
> >>>> Jan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalraihan@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Problem
> >>>>>
> >>>>> I created the below UserDefinedFunction to get the yesterday's day
in
> >>>>> the format I wanted as I will be passing the format into this below
> method
> >>>>> from the query.
> >>>>>
> >>>>>
> >>>>>
> >>>>> public final class YesterdayDate extends UDF {
> >>>>>
> >>>>>
> >>>>>
> >>>>>                 public String evaluate(final String format) {
> >>>>>
> >>>>>                                 DateFormat dateFormat = new
> >>>>> SimpleDateFormat(format);
> >>>>>
> >>>>>                                 Calendar cal =
> Calendar.getInstance();
> >>>>>
> >>>>>                                 cal.add(Calendar.DATE, -1);
> >>>>>
> >>>>>                                 return
> >>>>> dateFormat.format(cal.getTime()).toString();
> >>>>>
> >>>>>                 }
> >>>>>
> >>>>> }
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> So whenever I try to run the query like below by adding the jar
to
> >>>>> classpath and creating the temporary function yesterdaydate, I
> always get
> >>>>> zero result back-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> create temporary function yesterdaydate as
> >>>>> 'com.example.hive.udf.YesterdayDate';
> >>>>>
> >>>>> OK
> >>>>>
> >>>>> Time taken: 0.512 seconds
> >>>>>
> >>>>>
> >>>>>
> >>>>> Below is the query I am running-
> >>>>>
> >>>>>
> >>>>>
> >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
> LIMIT
> >>>>> 10;
> >>>>>
> >>>>> OK
> >>>>>
> >>>>>
> >>>>>
> >>>>> And I always get zero result back but the data is there in that
table
> >>>>> for Aug 5th.
> >>>>>
> >>>>>
> >>>>>
> >>>>> What wrong I am doing? Any suggestions will be appreciated.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
> >>>>> substitution thing, so I cannot use hiveconf here and the above
> table has
> >>>>> been partitioned on dt(date) column.
> >>>>
> >>>>
> >>>
> >>
> >
>

Mime
View raw message