Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of yogi.wan.kenobi@gmail.com
 designates 209.85.214.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=hfN57rs5Aq0p7AYq6ykcMe9nuyNBODbGfmxMfgeQWw9x9c4NuzXA0hGKtEzlIqCDSH
         L8PC9lzgpkDIEd5fJauTOM+30e/D275TH7yW326jdTeCJ1hd6URFRPLpRpNiOo7Cpd6B
         OaatvUOPbjn/eVZ17BhLKSpfSIb6yAsVTcC4E=
MIME-Version: 1.0
In-Reply-To: <1DCBB46D-C7BB-4C7D-87DB-116CB1F378A8@gmail.com>
References: <BANLkTik=G+OUCmiEzu9Q5nx4R6xc8mciVQ@mail.gmail.com>
	<BANLkTikFRU4TvgJkCw2Dgvvp1x=ak6J3mg@mail.gmail.com>
	<1DCBB46D-C7BB-4C7D-87DB-116CB1F378A8@gmail.com>
Date: Mon, 13 Jun 2011 12:08:29 -0700
Message-ID: <BANLkTimOMPw5qKLv-48yKDy-sctmOB--ww@mail.gmail.com>
Subject: Re: Hive Query Question
From: Tim Spence <yogi.wan.kenobi@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=485b3979d9b60f190504a59ca52f

--485b3979d9b60f190504a59ca52f
Content-Type: text/plain; charset=ISO-8859-1

Praveen,
My apologies--I meant to suggest a streaming function because a UDF would
not be able to hold state either.  Look at the documentation for TRANSFORM (
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform ).  Your
transformation script can be used to compare timestamps from successive
lines of your data.
Tim


On Sat, Jun 11, 2011 at 12:32 PM, Praveen <pk1uuu@gmail.com> wrote:

> Do you mean that my UDF would store the timestamp of the current row in a
> static field in the UDF's implementation, and when processing the next row,
> use that field to get the previous row's value ?
>
> Can anyone comment on whether that's safe, re: I'm not familiar with Hive
> internals ?
>
> Thanks,
>
> pk
>
> Sent from my iPhone
>
> On Jun 10, 2011, at 11:18 PM, Tim Spence <yogi.wan.kenobi@gmail.com>
> wrote:
>
> Praveen,
> This would be best accomplished with a UDF because Hive does not support
> cursors.
> Best of luck,
> Tim
>
>
>
>
> On Fri, Jun 10, 2011 at 10:29 PM, Praveen Kumar < <pk1u.uu@gmail.com>
> pk1u.uu@gmail.com> wrote:
>
>> If I have table timestamps:
>>
>> hive> desc timestamps;
>>
>> OK
>> ts      bigint
>>
>>
>> hive> select ts from timestamps order by ts
>> OK
>>
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> 10
>> 30
>> 32
>> 34
>> 36
>> 38
>> 40
>> 42
>> 44
>> 46
>> 48
>> 50
>> 70
>> 74
>> 78
>> 100
>> 105
>> 110
>> 115
>>
>> and I want to make groups of the values where splits between groups
>> occur where two time-consecutive entries have a difference greater
>> than 10.
>>
>> Eg, above, the splits would be such that the numbers would be grouped
>> into these ranges:
>>
>> 0-10
>> 30-50
>> 70-78
>> 100-115
>>
>> because (30 - 10), (70 - 50), and (100 - 78) are each greater than 10.
>>
>> I'd like the query to result in the following:
>>
>> hive> select ...
>>
>> 0       7
>> 0       9
>> 0       6
>> 0       3
>> 0       10
>> 0       1
>> 0       4
>> 0       5
>> 0       8
>> 0       2
>> 30      34
>> 30      44
>> 30      40
>> 30      38
>> 30      36
>> 30      32
>> 30      46
>> 30      42
>> 30      48
>> 30      50
>> 30      30
>> 70      74
>> 70      70
>> 70      78
>> 100     100
>> 100     105
>> 100     110
>> 100     115
>>
>> What is the most efficient hive query that will do this ? Thanks,
>>
>> pk
>>
>
>

--485b3979d9b60f190504a59ca52f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Praveen,<br>My apologies--I meant to suggest a streaming function because a=
 UDF would not be able to hold state either.=A0 Look at the documentation f=
or TRANSFORM ( <a href=3D"http://wiki.apache.org/hadoop/Hive/LanguageManual=
/Transform">http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform</a>=
 ).=A0 Your transformation script can be used to compare timestamps from su=
ccessive lines of your data.<br>
Tim<br><br><br><br><div class=3D"gmail_quote">On Sat, Jun 11, 2011 at 12:32=
 PM, Praveen <span dir=3D"ltr">&lt;<a href=3D"mailto:pk1uuu@gmail.com">pk1u=
uu@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div bgcolor=3D"#FFFFFF"><div>Do you mean that my UDF would store the times=
tamp of the current row in a static field in the UDF&#39;s implementation, =
and when processing the next row, use that field to get the previous row=
9;s value ?</div>
<div><br></div><div>Can anyone comment on whether that&#39;s safe, re: I=
9;m not familiar with Hive internals ?</div><div><br></div><div>Thanks,</di=
v><div><br></div><div>pk<br><br>Sent from my iPhone</div><div><div></div>
<div class=3D"h5"><div><br>On Jun 10, 2011, at 11:18 PM, Tim Spence &lt;<a =
href=3D"mailto:yogi.wan.kenobi@gmail.com" target=3D"_blank">yogi.wan.kenobi=
@gmail.com</a>&gt; wrote:<br><br></div><div></div><blockquote type=3D"cite"=
><div>
Praveen,<br>This would be best accomplished with a UDF because Hive does no=
t support cursors.<br>Best of luck,<br>Tim<br><br><br><br><br><div class=3D=
"gmail_quote">On Fri, Jun 10, 2011 at 10:29 PM, Praveen Kumar <span dir=3D"=
ltr">&lt;<a href=3D"mailto:pk1u.uu@gmail.com" target=3D"_blank"></a><a href=
=3D"mailto:pk1u.uu@gmail.com" target=3D"_blank">pk1u.uu@gmail.com</a>&gt;</=
span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">If I have table timestamps:<br>
<br>
hive&gt; desc timestamps;<br>
<br>
OK<br>
ts =A0 =A0 =A0bigint<br>
<br>
<br>
hive&gt; select ts from timestamps order by ts<br>
OK<br>
<br>
1<br>
2<br>
3<br>
4<br>
5<br>
6<br>
7<br>
8<br>
9<br>
10<br>
30<br>
32<br>
34<br>
36<br>
38<br>
40<br>
42<br>
44<br>
46<br>
48<br>
50<br>
70<br>
74<br>
78<br>
100<br>
105<br>
110<br>
115<br>
<br>
and I want to make groups of the values where splits between groups<br>
occur where two time-consecutive entries have a difference greater<br>
than 10.<br>
<br>
Eg, above, the splits would be such that the numbers would be grouped<br>
into these ranges:<br>
<br>
0-10<br>
30-50<br>
70-78<br>
100-115<br>
<br>
because (30 - 10), (70 - 50), and (100 - 78) are each greater than 10.<br>
<br>
I&#39;d like the query to result in the following:<br>
<br>
hive&gt; select ...<br>
<br>
0 =A0 =A0 =A0 7<br>
0 =A0 =A0 =A0 9<br>
0 =A0 =A0 =A0 6<br>
0 =A0 =A0 =A0 3<br>
0 =A0 =A0 =A0 10<br>
0 =A0 =A0 =A0 1<br>
0 =A0 =A0 =A0 4<br>
0 =A0 =A0 =A0 5<br>
0 =A0 =A0 =A0 8<br>
0 =A0 =A0 =A0 2<br>
30 =A0 =A0 =A034<br>
30 =A0 =A0 =A044<br>
30 =A0 =A0 =A040<br>
30 =A0 =A0 =A038<br>
30 =A0 =A0 =A036<br>
30 =A0 =A0 =A032<br>
30 =A0 =A0 =A046<br>
30 =A0 =A0 =A042<br>
30 =A0 =A0 =A048<br>
30 =A0 =A0 =A050<br>
30 =A0 =A0 =A030<br>
70 =A0 =A0 =A074<br>
70 =A0 =A0 =A070<br>
70 =A0 =A0 =A078<br>
100 =A0 =A0 100<br>
100 =A0 =A0 105<br>
100 =A0 =A0 110<br>
100 =A0 =A0 115<br>
<br>
What is the most efficient hive query that will do this ? Thanks,<br>
<font color=3D"#888888"><br>
pk<br>
</font></blockquote></div><br>
</div></blockquote></div></div></div></blockquote></div><br>

--485b3979d9b60f190504a59ca52f--