Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CADQPeWxysz8fztfbr9bSY5gq8azCo2AD2Rv6vCdEkWOg9FQXJA@mail.gmail.com>
References: 
 <CAH8PO-b_gmZ1oaz9ETjionLkuuy-ZJ+Gkw0ypE7tfhWZ_B2N3Q@mail.gmail.com>
	<CALckxSNBhFgWLpSUhLYf8bErf0K24oCX7xzg0nFZKjYGb6xvKg@mail.gmail.com>
	<CAH8PO-ZuK4rgj20k2BZ5cGnWOSin6pESjfi+7_0FwnpwcVWVEA@mail.gmail.com>
	<CADQPeWxysz8fztfbr9bSY5gq8azCo2AD2Rv6vCdEkWOg9FQXJA@mail.gmail.com>
Date: Thu, 14 May 2015 17:18:04 -0700
Message-ID: 
 <CAH8PO-Y46zufRCeqy9RFvt-bt1uGh02QiLXrN7KfcTJUiCLsgA@mail.gmail.com>
Subject: Re: Partition Columns
From: Appan Thirumaligai <appanhiveug@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=089e013d0f3826e670051613c752

--089e013d0f3826e670051613c752
Content-Type: text/plain; charset=UTF-8

Mungeol,

I did check the # of mappers and that did not change between the two
queries but when I ran a count(*) query the total execution time reduced
significantly for Query1 vs Query2. Also, the amount data the query reads
does change when the where clause changes. I still can't explain why one is
faster over the other.

Thanks,
Appan

On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo <mungeol.heo@gmail.com> wrote:

> Hi, Appan.
>
> you can just simply check the amount of data your query reads from the
> table. or the number of the mapper for running that query.
> then, you can know whether it filtering or scanning all table.
> Of course, it is a lazy approach. but, you can give a try.
> I think query 1 should work fine. because I am using a lot of that
> kind of queries and it works fine for me.
>
> Thanks,
> mungeol
>
> On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai
> <appanhiveug@gmail.com> wrote:
> > I agree with you Viral. I see the same behavior as well. We are on Hive
> 0.13
> > for the cluster where I'm testing this.
> >
> > On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria <viral.bajaria@gmail.com>
> > wrote:
> >>
> >> Hi Appan,
> >>
> >> In my experience I have seen that Query 2 does not use partition pruning
> >> because it's not a straight up filtering and involves using functions
> (aka
> >> UDFs).
> >>
> >> What version of Hive are you using ?
> >>
> >> Thanks,
> >> Viral
> >>
> >>
> >>
> >> On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai
> >> <appanhiveug@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I have a question on Hive Optimizer. I have a table with partition
> >>> columns  eg.,Sales partitioned by year, month, day. Assume that I have
> two
> >>> years worth of data on this table. I'm running two queries on this
> table.
> >>>
> >>> Query 1: Select * from Sales where year=2015 and month = 5 and day
> >>> between 1 and 7
> >>>
> >>> Query 2: Select * from Sales where concat_ws('-',cast(year as
> >>> string),lpad(cast(month as string),2,'0'),lpad(cast(day as
> string),2,'0'))
> >>> between '2015-01-01' and '2015-01-07'
> >>>
> >>> When I ran Explain command on the above two queries I get a Filter
> >>> operation for the 2nd Query and there is no Filter Operation for the
> first
> >>> query.
> >>>
> >>> My question is: Do both queries use the partitions or is it used only
> in
> >>> Query 1 and for Query 2 it will be a scan of all the data?
> >>>
> >>> Thanks for your help.
> >>>
> >>> Thanks,
> >>> Appan
> >>
> >>
> >
>

--089e013d0f3826e670051613c752
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Mungeol,<div><br></div><div>I did check the # of mappers a=
nd that did not change between the two queries but when I ran a count(*) qu=
ery the total execution time reduced significantly for Query1 vs Query2. Al=
so, the amount data the query reads does change when the where clause chang=
es. I still can&#39;t explain why one is faster over the other.</div><div><=
br></div><div>Thanks,</div><div>Appan</div><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On Thu, May 14, 2015 at 4:46 PM, Mungeol Heo <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:mungeol.heo@gmail.com" target=3D"_blank=
">mungeol.heo@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">Hi, Appan.<br>
<br>
you can just simply check the amount of data your query reads from the<br>
table. or the number of the mapper for running that query.<br>
then, you can know whether it filtering or scanning all table.<br>
Of course, it is a lazy approach. but, you can give a try.<br>
I think query 1 should work fine. because I am using a lot of that<br>
kind of queries and it works fine for me.<br>
<br>
Thanks,<br>
mungeol<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Fri, May 15, 2015 at 8:31 AM, Appan Thirumaligai<br>
&lt;<a href=3D"mailto:appanhiveug@gmail.com">appanhiveug@gmail.com</a>&gt; =
wrote:<br>
&gt; I agree with you Viral. I see the same behavior as well. We are on Hiv=
e 0.13<br>
&gt; for the cluster where I&#39;m testing this.<br>
&gt;<br>
&gt; On Thu, May 14, 2015 at 2:16 PM, Viral Bajaria &lt;<a href=3D"mailto:v=
iral.bajaria@gmail.com">viral.bajaria@gmail.com</a>&gt;<br>
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hi Appan,<br>
&gt;&gt;<br>
&gt;&gt; In my experience I have seen that Query 2 does not use partition p=
runing<br>
&gt;&gt; because it&#39;s not a straight up filtering and involves using fu=
nctions (aka<br>
&gt;&gt; UDFs).<br>
&gt;&gt;<br>
&gt;&gt; What version of Hive are you using ?<br>
&gt;&gt;<br>
&gt;&gt; Thanks,<br>
&gt;&gt; Viral<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Thu, May 14, 2015 at 1:48 PM, Appan Thirumaligai<br>
&gt;&gt; &lt;<a href=3D"mailto:appanhiveug@gmail.com">appanhiveug@gmail.com=
</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Hi,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; I have a question on Hive Optimizer. I have a table with parti=
tion<br>
&gt;&gt;&gt; columns=C2=A0 eg.,Sales partitioned by year, month, day. Assum=
e that I have two<br>
&gt;&gt;&gt; years worth of data on this table. I&#39;m running two queries=
 on this table.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Query 1: Select * from Sales where year=3D2015 and month =3D 5=
 and day<br>
&gt;&gt;&gt; between 1 and 7<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Query 2: Select * from Sales where concat_ws(&#39;-&#39;,cast(=
year as<br>
&gt;&gt;&gt; string),lpad(cast(month as string),2,&#39;0&#39;),lpad(cast(da=
y as string),2,&#39;0&#39;))<br>
&gt;&gt;&gt; between &#39;2015-01-01&#39; and &#39;2015-01-07&#39;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; When I ran Explain command on the above two queries I get a Fi=
lter<br>
&gt;&gt;&gt; operation for the 2nd Query and there is no Filter Operation f=
or the first<br>
&gt;&gt;&gt; query.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; My question is: Do both queries use the partitions or is it us=
ed only in<br>
&gt;&gt;&gt; Query 1 and for Query 2 it will be a scan of all the data?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks for your help.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks,<br>
&gt;&gt;&gt; Appan<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div></div>

--089e013d0f3826e670051613c752--