hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-896) Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
Date Fri, 18 Jan 2013 22:42:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557700#comment-13557700
] 

Alan Gates commented on HIVE-896:
---------------------------------

bq. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and
ORDER BY for syntax in OVER. Why?  To highlight the similarity. The Partition/Order specs
in a Window clause have the same meaning as Cluster/Distribute in HQL. 
This is only true as long as you have only one OVER clause, right?  As soon as you add the
ability to have separate OVER clauses partitioning by different keys (which users will want
very soon) you lose this identity.

Even if you decide to retain this I would argue that the standard PARTITION BY/ORDER BY syntax
should be accepted as well.  HQL already has enough one off syntax that makes life hard for
people coming from more standard SQL.  It should not be exacerbated.

bq. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need
do to some work to refactor this package. Yes you are right, things can be done in delaying
bringing rows into a partition and getting rid of rows once outside the window. This is true
for Windowing Table Function; especially for Range based Windows.
But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency
function will read the rows in a partition multiple times.

This is part of where I was going with my earlier question on why a windowing function would
ever return a partition.  I am becoming less convinced that it makes sense to combine windowing
and partition functions.  While they both take partitions as inputs they return different
things.  Partition functions return partitions and windowing functions return a single value.
 As you point out here the partition functions will also not be interested in the range limiting
features of windowing functions.  But taking advantage of this in windowing functions will
be very important for performance optimizations, I suspect.  At the very least it seems like
partitioning functions and windowing functions should be presented as separate entities to
users and UDF writers, even if for now Hive shares some of the framework for handling them
underneath.  This way in the future optimizations and new features can be added in a way that
is advantageous for each.
                
> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>
>                 Key: HIVE-896
>                 URL: https://issues.apache.org/jira/browse/HIVE-896
>             Project: Hive
>          Issue Type: New Feature
>          Components: OLAP, UDF
>            Reporter: Amr Awadallah
>            Priority: Minor
>         Attachments: HIVE-896.1.patch.txt
>
>
> Windowing functions are very useful for click stream processing and similar time-series/sliding-window
analytics.
> More details at:
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032
> -- amr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message