hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hamilton, Robert (Austin)" <robert.hamil...@hp.com>
Subject RE: Lag function in Hive
Date Tue, 10 Apr 2012 15:01:46 GMT
You can write a custom UDF - 

Here is one that I have played around with, along with some test SQL. It comes with no warrantee
:) 

Sorry I can't really share the test data, but hopefully you get the idea.  To run, compile
the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to
deploy to all the nodes on the cluster) and run the hive command below.

Note the "distribute by" and "sort by"  are critical.  Also the sub-select is just an artifice
to make sure the UDF is running in the reducer (so that it is sorted).  Maybe the hive experts
can suggest a better way for that to work...

#
# use live clickstream test data from 2012-01-12
#
hive -e "add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
	from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id
is not null 
	distribute by session_id 
	sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000
"

------------------------ Contents of Lag.java -----------------------------------------
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
    private int  counter;
    private String last_key;
    private String lastGroup;
    private String return_value="";

    public String evaluate(String key, String groupKey){
        if(groupKey==null){
                this.last_key=null;
        }else
          if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
                this.last_key=null;
        }
     return_value=this.last_key;
     this.last_key = key;
     this.lastGroup=groupKey;
     return return_value;
    }
}

Result of test run:

1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:37     NULL
1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:59     2012-01-12
00:00:37
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:05     2012-01-12
00:00:59
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:07     2012-01-12
00:01:05
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:11     2012-01-12
00:01:07
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:12     2012-01-12
00:01:11
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:24     2012-01-12
00:01:12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:32     2012-01-12
00:01:24
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:45     2012-01-12
00:01:32
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:48     2012-01-12
00:01:45

-----Original Message-----
From: Philip Tromans [mailto:philip.j.tromans@gmail.com] 
Sent: Tuesday, April 10, 2012 9:18 AM
To: user@hive.apache.org
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard
to parallelise in a map-reduce way (it requires knowing where you are in a result set, and
who your neighbours are and they in turn need to be present on the same node as you which
is difficult to guarantee).

Cheers,

Phil.

On 10 April 2012 14:44,  <karanveer.singh@barclays.com> wrote:
> Hi,
>
> Is there something like a 'lag' function in HIVE? The requirement is 
> to calculate difference for the same column for every 2 subsequent records.
>
> For example.
>
> Row, Column A, Column B
> 1, 10, 100
> 2, 20, 200
> 3, 30, 300
>
>
> The result that I need should be like:
>
> Row, Column A, Column B, Result
> 1, 10, 100, NULL
> 2, 20, 200, 100 (200-100)
> 3, 30, 300, 100 (300-200)
>
> Rgds,
> Karan
>
>
>
>
>
> This e-mail and any attachments are confidential and intended solely 
> for the addressee and may also be privileged or exempt from disclosure 
> under applicable law. If you are not the addressee, or have received 
> this e-mail in error, please notify the sender immediately, delete it 
> from your system and do not copy, disclose or otherwise act upon any 
> part of this e-mail or its attachments.
>
> Internet communications are not guaranteed to be secure or virus-free.
> The Barclays Group does not accept responsibility for any loss arising 
> from unauthorised access to, or interference with, any Internet 
> communications by any third party, or from the transmission of any 
> viruses. Replies to this e-mail may be monitored by the Barclays Group 
> for operational or business reasons.
>
> Any opinion or other information in this e-mail or its attachments 
> that does not relate to the business of the Barclays Group is personal 
> to the sender and is not given or endorsed by the Barclays Group.
>
> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
>
> Barclays Bank PLC is authorised and regulated by the Financial 
> Services Authority.

Mime
View raw message