Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of techvd@gmail.com designates
 209.85.221.201 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=pBpFVSx3mRTUWR+xWzf4EB5cIurE2tR2HU4MlK+tKDrVsFvt++7ov2iY7qJd5d39oz
         KxhQFO5PmvYs9qLYwO2Sy8nw/pBQYbJsn3SEJwxZ06F9Eb/itOEJsgiLraDzZ16JTCMm
         mDPyUdogcW9KrmcWsGfincTMqipdXACD1wDuE=
MIME-Version: 1.0
In-Reply-To: 
 <68B7689C98024D43B4C2709456F0B5200A21547152@SC-MBXC1.TheFacebook.com>
References: <5617ccb50910121704r25f19f4er68bb986f32254520@mail.gmail.com>
	 <68B7689C98024D43B4C2709456F0B5200A21547152@SC-MBXC1.TheFacebook.com>
Date: Thu, 15 Oct 2009 14:24:42 -0700
Message-ID: <5617ccb50910151424u7ee29e1fp766302862d223bcf@mail.gmail.com>
Subject: Re: Questions on date arithmetic/calculations
From: Vijay <techvd@gmail.com>
To: hive-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0023543336425df3e10475ffe87a

--0023543336425df3e10475ffe87a
Content-Type: text/plain; charset=ISO-8859-1

Thanks Ashish! This is the same approach I'm using as well and it seems to
be working very good

One thing I wasn't sure at first but was later surprised was how Hive is
able to figure out which partitions to work on using WHERE clauses. What I
mean by that is if I do something like WHERE month(ds)=9, it is able to
figure out that it needs to just use the partitions 2009-09-01 to
2009-09-30. How does the query engine know this? Does it evaluate partition
column related expressions locally?

Thanks,
Vijay

On Tue, Oct 13, 2009 at 2:25 PM, Ashish Thusoo <athusoo@facebook.com> wrote:

>  We store the the partitioning as
>
> YYYY-MM-DD
>
> in that format the string representation of the date has the same
> lexicographical ordering as the date itself. So if you have that as the
> format of the string in the ds column (hive does not have date functions
> yet), then the expressions of the kind
>
> ds >= '2009-08-15' and ds <= '2009-09-15'
>
> will pick up the right partitions.
>
> For doing counts over the month you can either extract the month from the
> date string using the substring(ds, 5, 2) udf in hive or you can use
> month(ds) and then put
> that in the group by clause of the query.
>
> Ashish
>
>  ------------------------------
> *From:* Vijay [mailto:techvd@gmail.com]
> *Sent:* Monday, October 12, 2009 5:05 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Questions on date arithmetic/calculations
>
> Hi,
>
> I have some basic questions on how hive handles dates and date arithmetic.
> I apologize if this has already been addressed. Per most samples on this
> site and elsewhere, I can have an access log table defined with a partition
> scheme that looks like this: ds='09-08-09'. This is obviously pretty good to
> partition the data. However, how can this information be used later in
> queries? For example, if I want to select data for all dates between
> 08/15/09 and 09/15/09, how would I do that? The partition column ds cannot
> be used with >= and similar operators right? Additionally, when is
> partitioned this way, how can I do counts on month, etc? Obviously all of
> these queries need to be expressed in a way hive can still take advantage of
> the partitioning scheme. I hope that makes sense.
>
> Thanks,
> Vijay
>

--0023543336425df3e10475ffe87a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks Ashish! This is the same approach I&#39;m using as well and it seems=
 to be working very good<br><br>One thing I wasn&#39;t sure at first but wa=
s later surprised was how Hive is able to figure out which partitions to wo=
rk on using WHERE clauses. What I mean by that is if I do something like WH=
ERE month(ds)=3D9, it is able to figure out that it needs to just use the p=
artitions 2009-09-01 to 2009-09-30. How does the query engine know this? Do=
es it evaluate partition column related expressions locally?<br>
<br>Thanks,<br>Vijay<br><br><div class=3D"gmail_quote">On Tue, Oct 13, 2009=
 at 2:25 PM, Ashish Thusoo <span dir=3D"ltr">&lt;<a href=3D"mailto:athusoo@=
facebook.com">athusoo@facebook.com</a>&gt;</span> wrote:<br><blockquote cla=
ss=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, 204); marg=
in: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">We store the the partitioning as</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">YYYY-MM-DD</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">in that format the string representation of the date has=20
the same lexicographical ordering as the date itself. So if you have that a=
s the=20
format of the string in the ds column (hive does not have date functions ye=
t),=20
then the expressions of the kind</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">ds &gt;=3D &#39;2009-08-15&#39; and ds &lt;=3D=20
&#39;2009-09-15&#39;</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">will pick up the right partitions.</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">For doing counts over the month you can either extract the=20
month from the date string using the substring(ds, 5, 2)=A0udf in hive or y=
ou=20
can use month(ds) and then put</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">that in the group by clause of the=20
query.</font></span></div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2"></font></span>=A0</div>
<div dir=3D"ltr" align=3D"left"><span><font color=3D"#0000ff" face=3D"Arial=
" size=3D"2">Ashish</font></span></div><br>
<div dir=3D"ltr" align=3D"left" lang=3D"en-us">
<hr>
<font face=3D"Tahoma" size=3D"2"><b>From:</b> Vijay [mailto:<a href=3D"mail=
to:techvd@gmail.com" target=3D"_blank">techvd@gmail.com</a>]=20
<br><b>Sent:</b> Monday, October 12, 2009 5:05 PM<br><b>To:</b>=20
<a href=3D"mailto:hive-user@hadoop.apache.org" target=3D"_blank">hive-user@=
hadoop.apache.org</a><br><b>Subject:</b> Questions on date=20
arithmetic/calculations<br></font><br></div><div><div></div><div class=3D"h=
5">
<div></div>Hi,<br><br>I have some basic questions on how hive handles dates=
 and=20
date arithmetic. I apologize if this has already been addressed. Per most=
=20
samples on this site and elsewhere, I can have an access log table defined =
with=20
a partition scheme that looks like this: ds=3D&#39;09-08-09&#39;. This is o=
bviously pretty=20
good to partition the data. However, how can this information be used later=
 in=20
queries? For example, if I want to select data for all dates between 08/15/=
09=20
and 09/15/09, how would I do that? The partition column ds cannot be used w=
ith=20
&gt;=3D and similar operators right? Additionally, when is partitioned this=
 way,=20
how can I do counts on month, etc? Obviously all of these queries need to b=
e=20
expressed in a way hive can still take advantage of the partitioning scheme=
. I=20
hope that makes sense.<br><br>Thanks,<br>Vijay<br></div></div></div>
</blockquote></div><br>

--0023543336425df3e10475ffe87a--