Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 18 Jan 2013 22:42:14 +0000 (UTC)
From: "Alan Gates (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12438744.1256158487291.161840.1358548934862@arcas>
In-Reply-To: <JIRA.12438744.1256158487291@arcas>
References: <JIRA.12438744.1256158487291@arcas>
Subject: [jira] [Commented] (HIVE-896) Add LEAD/LAG/FIRST/LAST analytical
 windowing functions to Hive.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557700#comment-13557700 ] 

Alan Gates commented on HIVE-896:
---------------------------------

bq. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why?  To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL. 
This is only true as long as you have only one OVER clause, right?  As soon as you add the ability to have separate OVER clauses partitioning by different keys (which users will want very soon) you lose this identity.

Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax should be accepted as well.  HQL already has enough one off syntax that makes life hard for people coming from more standard SQL.  It should not be exacerbated.

bq. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows.
But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times.

This is part of where I was going with my earlier question on why a windowing function would ever return a partition.  I am becoming less convinced that it makes sense to combine windowing and partition functions.  While they both take partitions as inputs they return different things.  Partition functions return partitions and windowing functions return a single value.  As you point out here the partition functions will also not be interested in the range limiting features of windowing functions.  But taking advantage of this in windowing functions will be very important for performance optimizations, I suspect.  At the very least it seems like partitioning functions and windowing functions should be presented as separate entities to users and UDF writers, even if for now Hive shares some of the framework for handling them underneath.  This way in the future optimizations and new features can be added in a way that is advantageous for each.
                
> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>
>                 Key: HIVE-896
>                 URL: https://issues.apache.org/jira/browse/HIVE-896
>             Project: Hive
>          Issue Type: New Feature
>          Components: OLAP, UDF
>            Reporter: Amr Awadallah
>            Priority: Minor
>         Attachments: HIVE-896.1.patch.txt
>
>
> Windowing functions are very useful for click stream processing and similar time-series/sliding-window analytics.
> More details at:
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032
> -- amr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira