hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramkumar <ramkumar.sugava...@yahoo.com>
Subject Re: nested UDFs on Partition column
Date Tue, 01 May 2012 13:38:42 GMT
Thanks. Yes, thats what the UDF writers seemed to have intended. The way this has been done
is by using the 'deterministic' attribute in  UDFType annotation.
See 
http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/ql/udf/UDFUnixTimeStamp.java.html



________________________________
 From: Philip Tromans <philip.j.tromans@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, April 19, 2012 4:38 PM
Subject: Re: nested UDFs on Partition column
 
In fact, it's just not a reasonable thing to do partition pruning on.
Imagine a situation where you had:

WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).

The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.

Cheers,

Phil.

On 19 April 2012 16:35, Philip Tromans <philip.j.tromans@gmail.com> wrote:
> I don't know what the state of Hive's partition pruning is, but I
> would imagine that the problem is that the two example you're giving
> are fundamentally different.
>
> 1) WHERE local_date = =date_add('2011-12-07',3) ,
>
> the udf is a
 function of some constants, so the constant gets
> evaluated at compile time.
>
> 2) WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>
> unix_timestamp() is not explicitly a constant, and so the expression
> won't be simplified.
>
> I would imagine that the constant simplification code probably doesn't
> know the difference between a partition column and a real column, and
> so treats everything as a real column. If local_date wasn't a
> partition column, then there's no reasonable way of simplifying that
> predicate at compile time.
>
> Cheers,
>
> Phil.
>
> On 19 April 2012 11:50, Nitin Pawar <nitinpawar432@gmail.com> wrote:
>> as per my understanding,
>>
>> In this case hive needs to look for all the
 partitions because it does not
>> have the value before hand on the partition check and note the udfs are
>> executed on the mapred and not on hive client side.
>>
>> I would suggest you write a hive query in a file and replace the partition
>> value with a variable
>> something like
>>
>> for partitionValue in values
>>
>>           hive $HIVEPARAMS -hiveconf  partition=$partition -e hivequery.hql
>>
>> and then in hivequery.sql you can refer the variable with
>>
>> where column_name = '${hiveconf:partition}'
>>
>> I may be wrong in interpreting the execution pattern of hivequery but this
>> approach solved my problem
>> Thanks,
>> nitin
>>
>>
>> On Thu, Apr 19, 2012 at 3:27 PM, Ramkumar <ramkumar.sugavanam@yahoo.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table partitioned by local_date.  When I write a query with
>>>
>>> WHERE local_date = =date_add('2011-12-07',3) ,
>>>
>>> hive executes the UDF ahead and looks only into the specific partitions.
>>> But when the udf becomes more complex like
>>>
>>> WHERE local_date = date_sub(to_date(from_unixtime(unix_timestamp())),3),
>>>
>>> hive looks through all the partitions even though the above function  can
>>> very well be computed ahead of time and optimize the query.  Is this
>>> behaviour intentional ? And is there a workaround other than hardcoding the
>>> date or using a
 param?
>>>
>>> Thanks,
>>> Ramkumar
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
Mime
View raw message