hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: Doubt on Hive Partitioning.
Date Mon, 01 Aug 2016 09:29:07 GMT
Hi Abhishek,

Yes, it can happen.

The only such scenarios I can think of are when you use a WHERE clause with
a non-constant clause.
As far as I know, partition only work on constant clauses, because it has
to evaluate them *before* starting the query in order to prune the
partitions.

For instance:

WHERE p = otherColumn
> here the predicate will depend on the row being read, thus all rows must
be read.
> if otherColumn is a partition, I don't think it work either

WHERE p IN (SELECT p FROM t2)
> here we could argue that Hive could optimize this by computing the sub
query first,
> and then do the partition pruning, but sadly I don't think this
optimisation has been implemented yet


WHERE f(p) = 'constant'
or
WHERE p = f('constant')

where f is a non-deterministic or non-stateful UDF.
An example of non-deterministic function are rand() and
unix_timestamp() because
it is evaluated differently at each row

So if you want today's partition, you should use instead current_date(),
which is deterministic,  since it takes the time of compilation of the
query.
It is only available since Hive 1.2.0 though.

You can know if a Hive UDF is deterministic and stateful by looking at the
class annotation UDFType in it's source code.
If you plan on writing your own UDF, don't forget to specifiy this
annotation as well.

hope this helps,

Furcy




On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey <Abhishek.Dubey@xoriant.com>
wrote:

> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>

Mime
View raw message