hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <>
Subject Re: Lag function in Hive
Date Wed, 11 Apr 2012 13:31:05 GMT
Hi Karan,
The error you mentioned you get on creating the temporary function typically happens when
there is a typo in the class name (com.example.hive.udf.Lag, in this case).

Can you ensure that the jar was properly built and contains the Lag class in the com.example.hive.udf


Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: www: 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 

----- Original Message -----
From: "karanveer singh" <>
Sent: Wednesday, April 11, 2012 4:15:59 AM
Subject: RE: Lag function in Hive

Rob n all - 

I tried below and created the jar file. For adding jar to class path, I do following:

hive> add jar /users/unix/singhka/Analytics.jar;

The above seems to have worked fine as I see the resource added but when I go ahead and create
a function, I get the following error. Any ideas what the issue can be?

hive> create temporary function lag as 'com.example.hive.udf.Lag';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask


-----Original Message-----
From: Hamilton, Robert (Austin) [] 
Sent: 10 April 2012 20:32
Subject: RE: Lag function in Hive

You can write a custom UDF - 

Here is one that I have played around with, along with some test SQL. It comes with no warrantee

Sorry I can't really share the test data, but hopefully you get the idea.  To run, compile
the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to
deploy to all the nodes on the cluster) and run the hive command below.

Note the "distribute by" and "sort by"  are critical.  Also the sub-select is just an artifice
to make sure the UDF is running in the reducer (so that it is sorted).  Maybe the hive experts
can suggest a better way for that to work...

# use live clickstream test data from 2012-01-12
hive -e "add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
	from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id
is not null 
	distribute by session_id 
	sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000

------------------------ Contents of -----------------------------------------
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
    private int  counter;
    private String last_key;
    private String lastGroup;
    private String return_value="";

    public String evaluate(String key, String groupKey){
          if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
     this.last_key = key;
     return return_value;

Result of test run:

1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:37     NULL
1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:59     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:05     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:07     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:11     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:12     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:24     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:32     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:45     2012-01-12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:48     2012-01-12

-----Original Message-----
From: Philip Tromans [] 
Sent: Tuesday, April 10, 2012 9:18 AM
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard
to parallelise in a map-reduce way (it requires knowing where you are in a result set, and
who your neighbours are and they in turn need to be present on the same node as you which
is difficult to guarantee).



On 10 April 2012 14:44,  <> wrote:
> Hi,
> Is there something like a 'lag' function in HIVE? The requirement is 
> to calculate difference for the same column for every 2 subsequent records.
> For example.
> Row, Column A, Column B
> 1, 10, 100
> 2, 20, 200
> 3, 30, 300
> The result that I need should be like:
> Row, Column A, Column B, Result
> 1, 10, 100, NULL
> 2, 20, 200, 100 (200-100)
> 3, 30, 300, 100 (300-200)
> Rgds,
> Karan
> This e-mail and any attachments are confidential and intended solely 
> for the addressee and may also be privileged or exempt from disclosure 
> under applicable law. If you are not the addressee, or have received 
> this e-mail in error, please notify the sender immediately, delete it 
> from your system and do not copy, disclose or otherwise act upon any 
> part of this e-mail or its attachments.
> Internet communications are not guaranteed to be secure or virus-free.
> The Barclays Group does not accept responsibility for any loss arising 
> from unauthorised access to, or interference with, any Internet 
> communications by any third party, or from the transmission of any 
> viruses. Replies to this e-mail may be monitored by the Barclays Group 
> for operational or business reasons.
> Any opinion or other information in this e-mail or its attachments 
> that does not relate to the business of the Barclays Group is personal 
> to the sender and is not given or endorsed by the Barclays Group.
> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
> Barclays Bank PLC is authorised and regulated by the Financial 
> Services Authority.

View raw message