Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8F675200B51 for ; Mon, 1 Aug 2016 11:29:33 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8DD76160A65; Mon, 1 Aug 2016 09:29:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AC627160A5D for ; Mon, 1 Aug 2016 11:29:32 +0200 (CEST) Received: (qmail 69771 invoked by uid 500); 1 Aug 2016 09:29:31 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 69761 invoked by uid 99); 1 Aug 2016 09:29:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Aug 2016 09:29:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 056FB1A027E for ; Mon, 1 Aug 2016 09:29:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=flaminem-com.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id vxgoZws0_SPa for ; Mon, 1 Aug 2016 09:29:28 +0000 (UTC) Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 5E9F65F613 for ; Mon, 1 Aug 2016 09:29:28 +0000 (UTC) Received: by mail-wm0-f43.google.com with SMTP id q128so362165273wma.1 for ; Mon, 01 Aug 2016 02:29:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flaminem-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=wu+Cw4kdHqA0dLhR/8jzD9xMXpMce8aek83D/BJBmtQ=; b=PwRIX2uEFThmKaiOtpDBuSh/rRi7fbQXrBzjeaQfHiAv8UCH86Yr1Xaa3Vz6IE667d fh8OS/EIDwXj5TS8x+Pmr0cCvDPjkdvbtG57S0J67lXWgAek/aQ54Mr7HsvDLPYu/jRC 2AediobeodvjvLpM7ysqp2vkJfF6GZ7yKceLIUYv7cVTQfaLkb+zUPxUf0LhEBCg1mFq +nMAj4Kw0H65IiRiyIEwpQkZGfpIjKC3P7DE9vhfTTSULNE/zk4MOb3UQnpxSM7amW9U 55655WDtJT/NZ7qCAyAsBz+Z2BTmjJbbY0Lgrj26ihR1A4qPTls0/Q8TJlYBZsJw9J7E sq4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=wu+Cw4kdHqA0dLhR/8jzD9xMXpMce8aek83D/BJBmtQ=; b=DcKWWUDkJ2GvQiRZKUFxD4MwtVCMea/IcxptBOgCh5UOlDDcuNEApNHyGrvDnWOhCL ZqRFI5V9ebWnoNRdmEsV70bKD8mu6b1ArsCHFCKARo5RLRe4Ihti/4uO0I1rY09+4QlT pONqi9gXDCe1rPDwUX/H/QwyeUdM5YbMj70NSaZnhNdaQilmZEOnXolByhPYVmlNUsCQ BEjxxuv+cfpZesLhGe1X4Yp/kFgvJijRnILeTYh5CJuKmhkppW2IDeFTnK7dd596gOeq YWj3f+zOkgvUQVNYR7dA2idsXAdq4ee7ISiy1Hq7zNhAbUDKDQ/pdfeB/pcWI0vtZrKk XbSg== X-Gm-Message-State: AEkoousN4DFLiRGQPKE8caEyL8XoNWkEQ6+bD5+VsUsT0XK0xYGtkNQEtVqxmI8sSWEhhWdA8gobiFqyxC2+qUfu X-Received: by 10.194.149.113 with SMTP id tz17mr56216266wjb.64.1470043766546; Mon, 01 Aug 2016 02:29:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.169.138 with HTTP; Mon, 1 Aug 2016 02:29:07 -0700 (PDT) In-Reply-To: <4d6c0f1f99384965860bb0d54fe95f12@XORPUN-MBX01.India.XoriantCorp.com> References: <4d6c0f1f99384965860bb0d54fe95f12@XORPUN-MBX01.India.XoriantCorp.com> From: Furcy Pin Date: Mon, 1 Aug 2016 11:29:07 +0200 Message-ID: Subject: Re: Doubt on Hive Partitioning. To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e01160dfa827fd90538ff3cc7 archived-at: Mon, 01 Aug 2016 09:29:33 -0000 --089e01160dfa827fd90538ff3cc7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Abhishek, Yes, it can happen. The only such scenarios I can think of are when you use a WHERE clause with a non-constant clause. As far as I know, partition only work on constant clauses, because it has to evaluate them *before* starting the query in order to prune the partitions. For instance: WHERE p =3D otherColumn > here the predicate will depend on the row being read, thus all rows must be read. > if otherColumn is a partition, I don't think it work either WHERE p IN (SELECT p FROM t2) > here we could argue that Hive could optimize this by computing the sub query first, > and then do the partition pruning, but sadly I don't think this optimisation has been implemented yet WHERE f(p) =3D 'constant' or WHERE p =3D f('constant') where f is a non-deterministic or non-stateful UDF. An example of non-deterministic function are rand() and unix_timestamp() because it is evaluated differently at each row So if you want today's partition, you should use instead current_date(), which is deterministic, since it takes the time of compilation of the query. It is only available since Hive 1.2.0 though. You can know if a Hive UDF is deterministic and stateful by looking at the class annotation UDFType in it's source code. If you plan on writing your own UDF, don't forget to specifiy this annotation as well. hope this helps, Furcy On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey wrote: > Hi All, > > > > I have a very big table *t* with billions of rows and it is partitioned > on a column *p*. Column *p * has datatype text and values like =E2=80=982= 01601=E2=80=99, > =E2=80=98201602=E2=80=99=E2=80=A6upto =E2=80=98201612=E2=80=99. > > And, I am running a query like : *Select columns from t where p=3D=E2=80= =99201604=E2=80=99.* > > > > My question is : Can there be a scenario/condition/probability that my > query will do a complete table scan on *t* instead of only reading data > for specified partition key. If yes, please put some light on those > scenario. > > > > I=E2=80=99m asking this because someone told me that there is a probabili= ty that > the query will ignore the partitioning and do a complete table scan to > fetch output. > > > > *Thanks & Regards,* > *Abhishek Dubey* > > > --089e01160dfa827fd90538ff3cc7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Abhishek,

Yes, it can happen.
<= div>
The only such scenarios I can think of are when you use = a WHERE clause with a non-constant clause.
As far as I know, part= ition only work on constant clauses, because it has to evaluate them bef= ore=C2=A0starting the query in order to prune the partitions.

For instance:

WHERE p =3D otherColumn=C2=A0
> her= e the predicate will depend on the row being read, thus all rows must be re= ad.
> if otherColumn is a partition, I don't think it work= either

WHERE p IN (SELECT p FROM t2)
> here we could= argue that Hive could optimize this by computing the sub query first,=C2= =A0
> and then do the partition pruning, but sadly I don't= think this optimisation has been implemented yet=C2=A0

<= /div>

WHERE f(p) =3D= 'constant'=C2=A0
or
WHERE p =3D f('constant')

where f is a non-deterministic or non-stateful UDF.
An ex= ample of non-deterministic function are rand() and=C2=A0unix_timestamp()=C2=A0because it is evaluated differently at each row
=

So if you wa= nt today's partition, you should use instead=C2=A0current_date(), which is deterministic, =C2=A0since i= t takes the time of compilation of the query.=C2=A0
It is only av= ailable since Hive 1.2.0 though.

You can know if a= Hive UDF is deterministic and stateful by looking at the class annotation = UDFType in it's source code.
If you plan on writing your own = UDF, don't forget to specifiy this annotation as well.

hope this helps,

Furcy




On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey <Abhishek.Dubey@xoriant.com> wrote:

Hi Al= l,

=C2=A0

I hav= e a very big table t with billions of rows and it is partitioned on a column p. = Column p =C2=A0has datatype text and values like =E2=80=98201601=E2=80=99, = =E2=80=98201602=E2=80=99=E2=80=A6upto =E2=80=98201612=E2=80=99.

And, = I am running a query like : Select columns from t where p=3D=E2=80=99201604=E2=80=99.<= /u>

=C2=A0

My qu= estion is : Can there be a scenario/condition/probability that my query wil= l do a complete table scan on t instead of only reading data for specified partition key. If yes, = please put some light on those scenario.

=C2=A0

I=E2=80=99m asking thi= s because someone told me that there is a probability that the query will i= gnore the partitioning and do a complete table scan to fetch output.

=C2=A0

Th= anks & Regards,
Abhishek Dubey =

=C2=A0


--089e01160dfa827fd90538ff3cc7--