hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harish Butani (JIRA)" <>
Subject [jira] [Commented] (HIVE-896) Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
Date Wed, 09 Jan 2013 04:32:20 GMT


Harish Butani commented on HIVE-896:

Hi Alan,
Thanks for taking the time. Here are my responses:

1. Could you point out the interfaces...
Yes you are right, from a function writer perspective TableFunctionEvaluator, TableFunctionResolver
are the important ifcs; PTFPartition(and PTFPartitionIterator) is the data container ifc.

2. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER
BY for syntax in OVER. Why?
To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning
as Cluster/Distribute in HQL. Note you can  use a Cluster/Distribute at the query level and
not specify any Partition spec in a Window clause. So the following are different ways for
saying the same thing:

select p_mfgr, p_name, 
sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between unbounded preceding
and current row)
from part;
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (rows between unbounded preceding and current row)
from part
distribute by p_mfgr
sort by p_name;
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (w1)
from part
window w1 as distribute by p_mfgr  sort by p_name rows between 2 preceding and 2 following;

(I just realized that there are no egs of using Cluster/Distribute in Wdw clauses in the tests;
we are adding them now)

3. Can I put one of the existing aggregate functions in an OVER clause using this?
I am not exactly clear what your question is. I may have answered it above. To be clear there
is no special Window Function. Any existing Hive UDAF invocation can have a Windowing specification.

tests 31,40,41 cover most of the UDAFs.

4. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need
do to some work to refactor this package. Yes you are right, things can be done in delaying
bringing rows into a partition and getting rid of rows once outside the window. This is true
for Windowing Table Function; especially for Range based Windows.

But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency
function will read the rows in a partition multiple times.

The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists which uses weak
refs and stores its data on disk. Done some testing with partitions with a million rows. But
I agree with what you are getting at: there is stuff that can be done to reduce the memory
footprint. Haven't gotten around to it....

> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>                 Key: HIVE-896
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: OLAP, UDF
>            Reporter: Amr Awadallah
>            Priority: Minor
>         Attachments: HIVE-896.1.patch.txt
> Windowing functions are very useful for click stream processing and similar time-series/sliding-window
> More details at:
> -- amr

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message