flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radu Tudoran <radu.tudo...@huawei.com>
Subject RE: Stream SQL and Dynamic tables
Date Mon, 30 Jan 2017 17:14:37 GMT
Hi Fabian,

Thanks for the clarifications. I have a follow up question: you say that operations are expected
to be bounded in space and time (e.g., the optimizer will do a cleanup after a certain timeout
period). - can I assume that this will imply that we will have at the level of the system
a couple of parameters that hold these thresholds and potentially can be setup?

For example having in the environment variable

Env.setCleanupTimeout(100,TimeUnit.Minutes);

...or alternatively perhaps directly at the level of the table (either table environment or
the table itself)

TableEnvironment tbEnv =...
tbEnv.setCleanupTimeOut(100,TimeUnit.Minutes)
Table tb=
tb.setCleanupTimeOut(100,TimeUnit.Minutes)



-----Original Message-----
From: Fabian Hueske [mailto:fhueske@gmail.com] 
Sent: Friday, January 27, 2017 9:41 PM
To: dev@flink.apache.org
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

the idea is to only support operations that are bounded in space and compute time:

- space: the size of state may not infinitely grow over time or with growing key domains.
For these cases, the optimizer will enforce a cleanup timeout and all data which is passed
that timeout will be discarded.
Operations which cannot be bounded in space will be rejected.

- compute time: certain queries can not be efficiently execute because newly arriving data
(late data or just newly appended rows) might trigger recomputation of large parts of the
current state. Operations that will result in such a computation pattern will be rejected.
One example would be event-time OVER ROWS windows as we discussed in the other thread.

So the plan is that the optimizer takes care of limiting the space requirements and computation
effort.
However, you are of course right. Retraction and long running windows can result significant
amounts of operator state.
I don't think this is a special requirement for the Table API (there are users of the DataStream
API with jobs that manage TBs of state). Persisting state to disk with RocksDB and scaling
out to more nodes should address the scaling problem initially. In the long run, the Flink
community will work to improve the handling of large state with features such as incremental
checkpoints and new state backends.

Looking forward to your comments.

Best,
Fabian

2017-01-27 11:01 GMT+01:00 Radu Tudoran <radu.tudoran@huawei.com>:

> Hi,
>
> Thanks for the clarification Fabian - it is really useful.
> I agree that we should consolidate the module and avoid the need to 
> further maintain 3 different "projects". It does make sense to see the 
> current (I would call it)"Stream SQL" as a table with append semantics.
> However, one thing that should be clarified is what is the best way 
> from the implementation point of view to keep the state of the table 
> (if we can actually keep it - though the need is clear for supporting 
> retraction). As the input is a stream and the table is append of 
> course we run in the classical infinite issue that streams have. What should be the approach?
> Should we consider keeping the data in something like the statebackend 
> now for windows, and then pushing them to the disk (e.g., like 
> FSStateBackends). Perhaps with the disk we can at least enlarge the 
> horizon of what we keep.
> I will give some comments and some thoughts in the document about this.
>
>
> Dr. Radu Tudoran
> Senior Research Engineer - Big Data Expert IT R&D Division
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> European Research Center
> Riesstrasse 25, 80992 München
>
> E-mail: radu.tudoran@huawei.com
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:fhueske@gmail.com]
> Sent: Thursday, January 26, 2017 3:37 PM
> To: dev@flink.apache.org
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to have dynamic tables as the common ground for Table API and
> SQL.
> I don't think it is a good idea to implement and maintain 3 different
> relational APIs with possibly varying semantics.
>
> Actually, you can see the current status of the Table API / SQL on stream
> as a subset of the proposed semantics.
> Right now, all streams are implicitly converted into Tables with APPEND
> semantics. The currently supported operations (selection, filter, union,
> group windows) return streams.
> The only thing that would change for these operations would be the output
> mode to be retraction mode by default in order to be able to emit updated
> records (e.g., updated aggregates due to late records).
>
> The document is not final and we can of course discuss the proposal.
>
> Best, Fabian
>
> 2017-01-26 11:33 GMT+01:00 Radu Tudoran <radu.tudoran@huawei.com>:
>
> > Hi all,
> >
> >
> >
> > I have a question with respect to the scope behind the initiative
> > behind relational queries on data streams:
> >
> > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > f4konQPW4tnl8THw6rzGUdaqU/edit#
> >
> >
> >
> > Is the approach of using dynamic tables intended to replace the
> > implementation and mechanisms build now in stream sql ? Or will the
> > two co-exist, be built one on top of the other?
> >
> >
> >
> > Also – is the document in the final form or can we still provide
> > feedback / ask questions?
> >
> >
> >
> > Thanks for the clarification (and sorry if I missed at some point the
> > discussion that might have clarified this)
> >
> >
> >
> > Dr. Radu Tudoran
> >
> > Senior Research Engineer - Big Data Expert
> >
> > IT R&D Division
> >
> >
> >
> > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> >
> > European Research Center
> >
> > Riesstrasse 25, 80992 München
> >
> >
> >
> > E-mail: *radu.tudoran@huawei.com <radu.tudoran@huawei.com>*
> >
> > Mobile: +49 15209084330 <+49%201520%209084330>
> >
> > Telephone: +49 891588344173 <+49%2089%201588344173>
> >
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com Registered
> > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> >
> > This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address
> > is listed above. Any use of the information contained herein in any
> > way (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete it!
> >
> >
> >
>
Mime
View raw message