flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5653) Add processing time OVER ROWS BETWEEN x PRECEDING aggregation to SQL
Date Mon, 27 Mar 2017 08:35:41 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942840#comment-15942840
] 

ASF GitHub Bot commented on FLINK-5653:
---------------------------------------

Github user fhueske commented on the issue:

    https://github.com/apache/flink/pull/3574
  
    Hi @huawei-flink, thanks for your detailed explanation. 
    
    The benefits of the MapState are that we only need to deserialize all keys and not all
rows as in the ValueState or ListState case. Identifying the smallest key (as needed for OVER
ROWS) is basically for free. Once the smallest key has been found, we only need to deserialize
the rows that need to be retracted. All other rows are not touched at all. 
    
    The benchmarks that @rtudoran ran were done with an in-memory state backend, which does
not de/serialize data but keeps the state as objects on the heap. I think the numbers would
be different if you would switch to the RocksDB state backend which serializes all data (RocksDB
is the only state backend recommended for production settings). In fact, I would read from
the result of the benchmarks that sorting the keys does not have a major impact on the performance.
Another important aspect of the design is that RocksDB iterates of the the map keys in order,
so even sorting (or rather ensuring a sorted order) becomes O(n). 
    
    I do see the benefits of keeping data in order, but de/serialization is one of the major
costs when processing data on the JVM and it makes a lot of sense to optimize for reduced
de/serialization overhead.


> Add processing time OVER ROWS BETWEEN x PRECEDING aggregation to SQL
> --------------------------------------------------------------------
>
>                 Key: FLINK-5653
>                 URL: https://issues.apache.org/jira/browse/FLINK-5653
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: Fabian Hueske
>            Assignee: Stefano Bortoli
>
> The goal of this issue is to add support for OVER ROWS aggregations on processing time
streams to the SQL interface.
> Queries similar to the following should be supported:
> {code}
> SELECT 
>   a, 
>   SUM(b) OVER (PARTITION BY c ORDER BY procTime() ROWS BETWEEN 2 PRECEDING AND CURRENT
ROW) AS sumB,
>   MIN(b) OVER (PARTITION BY c ORDER BY procTime() ROWS BETWEEN 2 PRECEDING AND CURRENT
ROW) AS minB
> FROM myStream
> {code}
> The following restrictions should initially apply:
> - All OVER clauses in the same SELECT clause must be exactly the same.
> - The PARTITION BY clause is optional (no partitioning results in single threaded execution).
> - The ORDER BY clause may only have procTime() as parameter. procTime() is a parameterless
scalar function that just indicates processing time mode.
> - UNBOUNDED PRECEDING is not supported (see FLINK-5656)
> - FOLLOWING is not supported.
> The restrictions will be resolved in follow up issues. If we find that some of the restrictions
are trivial to address, we can add the functionality in this issue as well.
> This issue includes:
> - Design of the DataStream operator to compute OVER ROW aggregates
> - Translation from Calcite's RelNode representation (LogicalProject with RexOver expression).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message