hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Butani, Harish" <harish.but...@sap.com>
Subject RE: Lag function in Hive
Date Wed, 11 Apr 2012 21:39:17 GMT
Hi Ashutosh,

Thanks for taking a look. Yes definitely open to contributing back to Hive.
Had a conversation with Carl Steinbach last week about this.
Will send you a follow up message.

Regards,
Harish

From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
Sent: Wednesday, April 11, 2012 7:55 AM
To: user@hive.apache.org; Butani, Harish
Subject: Re: Lag function in Hive

Hey Harish,

Awesome work on SQL Windowing. Judging from participation on this thread, it seems windowing
is of sizable interest to Hive community. Would you consider contributing your work upstream
in Hive? If its in Hive contrib, it will be accessible to lot of folks using Hive out of box.

Thanks,
Ashutosh
On Tue, Apr 10, 2012 at 08:10, Butani, Harish <harish.butani@sap.com<mailto:harish.butani@sap.com>>
wrote:
Hi Karan,

SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a good fit for
your use case.

We have a lag function and you can say something like

>From table
Partition by col1, col2...
Order by col1, col2,...
Select colX, <colX - lag(colX, 1)>

(there is a lag example on the wiki, and other time series egs based on the NPath table function)

You can control the partitioning by the partitioning and order clauses.
Partitions could be arbitrarily large (so you could partition by a dummy column and have all
rows in 1 partition) but works best when there are natural partitions in your data and you
are ok with not needing to calculate across partitions.


Regards,
Harish.

-----Original Message-----
From: karanveer.singh@barclays.com<mailto:karanveer.singh@barclays.com> [mailto:karanveer.singh@barclays.com<mailto:karanveer.singh@barclays.com>]
Sent: Tuesday, April 10, 2012 7:52 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Lag function in Hive

Thanks - I will check this out.

 Meanwhile, would default clustering happen using rownum? How can I check on how is clustering
happening in our environment?

Rgds

----- Original Message -----
From: David Kulp <dkulp@fiksu.com<mailto:dkulp@fiksu.com>>
To: user@hive.apache.org<mailto:user@hive.apache.org> <user@hive.apache.org<mailto:user@hive.apache.org>>
Sent: Tue Apr 10 15:45:25 2012
Subject: Re: Lag function in Hive

New here.  Hello all.

Could you try a self-join, possibly also restricted to partitions?

E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1
AND t1.partition=foo AND t2.partition=bar

If your data is clustered by rownum, then this join should, in theory, be relatively fast
-- especially if it makes sense to exploit partitions.

-d

On Apr 10, 2012, at 10:37 AM, <karanveer.singh@barclays.com<mailto:karanveer.singh@barclays.com>>
<karanveer.singh@barclays.com<mailto:karanveer.singh@barclays.com>> wrote:

> Makes sense but is not the distribution across nodes for a chunk of records in that order.
>
> If Hive cannot help me do this, is there another way I can do this? I tried generating
an identifier using the perl script invoked using Hive but it does not seem to work fine.
While the stand alone script works fine, when the record is created in hive using std output
from perl - I see 2 records for some of the unique identifiers. I explored the possibility
of default data type changes but that does not solve the problem.
>
> Regards,
> Karan
>
>
> -----Original Message-----
> From: Philip Tromans [mailto:philip.j.tromans@gmail.com<mailto:philip.j.tromans@gmail.com>]
> Sent: 10 April 2012 19:48
> To: user@hive.apache.org<mailto:user@hive.apache.org>
> Subject: Re: Lag function in Hive
>
> Hi Karan,
>
> To the best of my knowledge, there isn't one. It's also unlikely to
> happen because it's hard to parallelise in a map-reduce way (it
> requires knowing where you are in a result set, and who your
> neighbours are and they in turn need to be present on the same node as
> you which is difficult to guarantee).
>
> Cheers,
>
> Phil.
>
> On 10 April 2012 14:44,  <karanveer.singh@barclays.com<mailto:karanveer.singh@barclays.com>>
wrote:
>> Hi,
>>
>> Is there something like a 'lag' function in HIVE? The requirement is to
>> calculate difference for the same column for every 2 subsequent records.
>>
>> For example.
>>
>> Row, Column A, Column B
>> 1, 10, 100
>> 2, 20, 200
>> 3, 30, 300
>>
>>
>> The result that I need should be like:
>>
>> Row, Column A, Column B, Result
>> 1, 10, 100, NULL
>> 2, 20, 200, 100 (200-100)
>> 3, 30, 300, 100 (300-200)
>>
>> Rgds,
>> Karan
>>
>>
>>
>>
>>
>> This e-mail and any attachments are confidential and intended solely for the
>> addressee and may also be privileged or exempt from disclosure under
>> applicable law. If you are not the addressee, or have received this e-mail
>> in error, please notify the sender immediately, delete it from your system
>> and do not copy, disclose or otherwise act upon any part of this e-mail or
>> its attachments.
>>
>> Internet communications are not guaranteed to be secure or virus-free.
>> The Barclays Group does not accept responsibility for any loss arising from
>> unauthorised access to, or interference with, any Internet communications by
>> any third party, or from the transmission of any viruses. Replies to this
>> e-mail may be monitored by the Barclays Group for operational or business
>> reasons.
>>
>> Any opinion or other information in this e-mail or its attachments that does
>> not relate to the business of the Barclays Group is personal to the sender
>> and is not given or endorsed by the Barclays Group.
>>
>> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
>> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
>>
>> Barclays Bank PLC is authorised and regulated by the Financial Services
>> Authority.


This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, or from the
transmission of any viruses. Replies to this e-mail may be
monitored by the Barclays Group for operational or business
reasons.

Any opinion or other information in this e-mail or its attachments
that does not relate to the business of the Barclays Group is
personal to the sender and is not given or endorsed by the Barclays
Group.

Barclays Bank PLC. Registered in England and Wales (registered no.
1026167).
Registered Office: 1 Churchill Place, London, E14 5HP, United
Kingdom.

Barclays Bank PLC is authorised and regulated by the Financial
Services Authority.


Mime
View raw message