hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Workman <justinjwork...@gmail.com>
Subject Help With Hive Windowing
Date Thu, 10 Dec 2015 17:21:20 GMT
I am attempting to run a simple query over a table with website click data
using last_value and lag functions to get the last two times we saw URL's
being hit by specific bots. The table I am querying has the following schema

Table Schema
URL
HIT_TIME
USER_AGENT
MONTH
DAY

Created as an unpartitioned ORC table with the last months worth of click
data. Based on the query below, I would expect to see one record for each
URL with the last HIT_TIME, and the previous HIT_TIME. Instead I am getting
several records for each URL, each record does appear to be returning the
last HIT_TIME and the previous HIT_TIME. How can I limit this to only
return one record per URL with the true last HIT_TIME and previous HIT_TIME?

Query
select
      url,
       last_value(hit_time) over w as last_seen_dt,
       lag(hit_time,1,0) over w as prev_seen_dt
from prod_user.web_hits
window w as (partition by url order by hit_time)

Any advice would be greatly appreciated. I am sure I just don't understand
the windowing syntax correct. I've tried wrapping this in a sub-select and
grouping by url and selecting distinct url. I've also played with the
ordering in the window specification.

Versions
Hive-0.11.XXX from CDH 5.3.3.

Thanks!
Justin

Mime
View raw message