samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <>
Subject Re: What next for streaming SQL?
Date Thu, 30 Apr 2015 00:29:36 GMT
Can you give an example of the SQL syntax you are using for tumbling windows? Does it use GROUP
BY and FLOOR, as in

What do you mean by “tumbling window size”? You can easily deduce that "FLOOR(rowtime
TO HOUR)” covers an hour range of the rowtime column, but to compute the number of rows
or bytes you’d have to make assumptions about data rates and then you’d only get an estimate.

Regarding re-issuing totals to incorporate late arrivals. It sounds useful, but you’ll have
to be careful that it doesn’t screw up other operators downstream. Imagine that you have
an aggregate followed by another aggregate that rolls it up. If the downstream operator isn’t
expecting duplicates then it may double-count.

I think it may be OK if the stream defines a primary key, specifies that there may be duplicates
and the duplicates will be compacted. But in short, we need more metadata, because the consumer
is a dumb operator not a smart human.

Do you have a URL for Yi’s design doc?

By the way, I am just about to check in a patch for
<> “FILTER clause for aggregate functions”.
I think it would be really useful for streaming queries, because you can’t afford to re-run
the query for a subset of the data. Samza-sql should get this virtually for free when it gets
the next Calcite release.

> On Apr 28, 2015, at 7:40 AM, Milinda Pathirage <> wrote:
> Hi Julian,
> I am working on tumbling windows and hoping to have a look at other types of window aggregates
next. I was trying to extract the window spec out from the aggregate operator (for tumbling
window) and figure out that its impossible to infer tumbling window size from date time expressions
or from an expression over any other type of monotonic field (such as row number for tuple
based windows). So we were thinking of implementing aggregates like we normally implement
stream aggregate in standard SQL (assuming group by fields are sorted) but with support for
handling out of order arrivals. One difference in this method compared to stream aggregate
from SQL is that an input row(s) can contribute to multiple outputs due to late arrivals.
My plan is to emit the first result for tumbling window aggregate when we see a new tuple
from the next window and emit result again if we get a tuple for an old window. We'll have
a window closing policy where we will not handle tuples arriving after the window timeout.
Yi's window operator design document contains most of the details required. What do you think
about this approach to implement tumbling windows? We highly appreciate your feedback on this.
> Thanks
> Milinda
> On Mon, Apr 27, 2015 at 6:15 PM, Julian Hyde < <>>
> Milinda,
> I have seen your work adding initial streaming SQL to Samza. Good stuff.
> Which types of query are you thinking of doing next?
> As of calcite-1.2, the streaming extensions are in Calcite’s master branch. (See
<>.) We are a couple
of weeks away from the next Calcite release. If you need some work done in Calcite, now would
be a good time.
> Julian
> -- 
> Milinda Pathirage
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message