samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milinda Pathirage <mpath...@umail.iu.edu>
Subject Re: Triggering emits for streaming window aggregates
Date Fri, 26 Jun 2015 18:40:44 GMT
Hi Yi,

In this specific case ordering is declared in the schema. Quoting from
Calcite documentation

~~~~~~~~~~~~~~~~~~~~
Monotonic columns need to be declared in the schema. The monotonicity is
enforced when records enter the stream and assumed by queries that read
from that stream. We recommend that you give each stream a timestamp column
called rowtime, but you can declare others, orderId, for example.
~~~~~~~~~~~~~~~~~~~~



If we can propagate this ordering information to LogicalAggregate then we
can easily handle this. As I understand required information is accessible
to Calcite query planner. But in our case we need this information after we
get the query plan from Calcite. AFAIK, current API doesn't provide a way
to get this information in scenarios like above where ORDER BY is not
specified in the query (I am not 100% sure about ORDER BY case too. I need
to have a look at a query plan generated for a query with ORDER BY).

Thanks
Milinda

On Fri, Jun 26, 2015 at 2:30 PM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, Milinda,
>
> I thought that in your example, the ordering field is given in GROUP BY.
> Are we missing a way to pass the ordering field(s) to the LogicalAggregate?
>
> -Yi
>
> On Fri, Jun 26, 2015 at 10:49 AM, Milinda Pathirage <mpathira@umail.iu.edu
> >
> wrote:
>
> > Hi Julian,
> >
> > Even though this is a general question across all the streaming
> aggregates
> > which utilize GROUP BY clause and a monotonic timestamp field for
> > specifying the window, but I am going to stick to most basic example
> (which
> > is from Calcite Streaming document).
> >
> > SELECT STREAM FLOOR(rowtime TO HOUR) AS rowtime,
> >   productId,
> >   COUNT(*) AS c,
> >   SUM(units) AS units
> > FROM Orders
> > GROUP BY FLOOR(rowtime TO HOUR), productId;
> >
> > I was trying to implement an aggregate operator which handles tumbling
> > windows via the monotonic field in GROUP By clause in addition to the
> > general aggregations. I went in this path because I thought integrating
> > windowing aspects (at least for tumbling and hopping) into aggregate
> > operator will be easier than trying to extract the window spec from the
> > query plan for a query like above. But I hit a wall when trying to figure
> > out trigger condition for emitting aggregate results. I was initially
> > planning to detect new values for FLOOR(rowtime TO HOUR) and emit current
> > aggregate results for previous groups (I was thinking to keep old groups
> > around until we clean them up after a timeout). But when trying to
> > implement this I figured out that I don’t know how to check which GROUP
> BY
> > field is monotonic so that I only detect new values for the monotonic
> > field/fields, not for the all the other fields. I think this is not a
> > problem for tables because we have the whole input before computation and
> > we wait till we are done with the input before emitting the results.
> >
> > With regards to above can you please clarify following things:
> >
> > - Is the method I described above for handling streaming aggregates make
> > sense at all?
> > - Is there a way that I can figure out which fields/expressions in
> > LogicalAggregate are monotonic?
> > - Or can we write a rule to annotate or add extra metadata to
> > LogicalAggregate so that we can get monotonic fields in the GROUP By
> clause
> >
> > Thanks in advance
> > Milinda
> >
> >
> > --
> > Milinda Pathirage
> >
> > PhD Student | Research Assistant
> > School of Informatics and Computing | Data to Insight Center
> > Indiana University
> >
> > twitter: milindalakmal
> > skype: milinda.pathirage
> > blog: http://milinda.pathirage.org
> >
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message