flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Design documents for consolidated DataStream API
Date Tue, 14 Jul 2015 08:55:09 GMT
keyBy() does not do any grouping. Grouping in streams in not defined
without windows.

On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <gyula.fora@gmail.com> wrote:

> If we only want to have either keyBy or groupBy, why not keep groupBy? That
> would be more consistent with the batch api.
> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <sewen@apache.org> wrote:
>
> > Concerning your comments:
> >
> > 1) In the new design, there is no grouping without windowing. The
> > KeyedDataStream subsumes the grouping and key-ing for partitioned state.
> >
> >     The keyBy() + window() makes a parallel grouped window
> >     keyBy() alone allows access to partitioned state.
> >
> >     My thought was that this is simpler, because it needs not groupBy()
> and
> > keyBy(), but one construct to handle both cases.
> >
> > 2) The discretization is a rough thought and is nothing for the short
> term.
> > It totally needs more thoughts. I put it there to have it as a sketch for
> > how to evolve this.
> >
> >     The idea is of course to not have a single data set, but a series of
> > data set. In each discrete time slice, the data set can be treated like a
> > regular data set.
> >
> >     Let's kick off a separate design for the discretization. Joins are
> good
> > to talk about (data sets can be joined with data set), and I am sure
> there
> > are more questions coming up.
> >
> >
> > Does that make sense?
> >
> >
> >
> >
> >
> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra <gyula.fora@gmail.com>
> wrote:
> >
> > > I think Marton has some good points here.
> > >
> > > 1) Is KeyedDataStream a better name if this is only a renaming?
> > >
> > > 2) the discretize semantics is unclear indeed. Are we operating on a
> > single
> > > or sequence of datasets? If the latter why not call it something else
> > > (dstream). How are joins and other binary operators defined for
> different
> > > discretizations etc.
> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi <mbalassi@apache.org>
> > > wrote:
> > >
> > > > Generally I agree with the new design. Two concerns:
> > > >
> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it the
> latter a
> > > > special case of the former?
> > > >
> > > > The KeyedDataStream as described in the design document is a bit
> > unclear
> > > > for me. It lists the following usages:
> > > >   a) It is the first step in building a window stream, on top of
> which
> > > the
> > > > grouped/windowed aggregation and reduce-style function can be applied
> > > >   b) It allows to use the "by-key" state of functions. Here, every
> > record
> > > > has access to a state that is scoped by its key. Key-scoped state can
> > be
> > > > automatically redistributed and repartitioned.
> > > >
> > > > The code snippet describes a use case where the computation and the
> > > access
> > > > of the state is used the way currently the GroupedDataStream should
> > > work. I
> > > > suppose this is the example for case b). Would case a) also window
> > > elements
> > > > by key? If yes, then this is practically a renaming and enhancement
> of
> > > the
> > > > GroupedDataStream functionality with keyed state. Then the
> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner,
> > > > KeySelector)construction does not make much sense as the user only
> > > operates
> > > > within the scope of the keyselector and not the partitioner anyway.
> > > >
> > > > I personally think KeyedDataStream as a name does not necessarily
> > suggest
> > > > that the records are grouped by key, it only suggests partitioning by
> > > key -
> > > > at least for me. :)
> > > >
> > > > 2) The API for discretization is not convenient IMHO
> > > >
> > > > The discretization part declares that the output of
> > > DataStream.discretize()
> > > > is a sequence of DataSets. I love this approach, but then in the code
> > > > snippet the return value of this function is simply a DataSet and
> uses
> > it
> > > > as such. The take home message of that code is the following: this is
> > > > actually the way you would like to program on these sequence of
> > DataSets,
> > > > most probably you would like to do the same with each of them. If
> that
> > is
> > > > the case we should provide a nice utility for that. I think Spark
> > > > Streaming's DStream.foreachRDD() is fairly useful for this purpose.
> > > >
> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra <gyula.fora@gmail.com>
> > > wrote:
> > > >
> > > > > +1
> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen <sewen@apache.org>
> > wrote:
> > > > >
> > > > > > If naming is the only concern, then we should go ahead, because
> we
> > > can
> > > > > > change names easily (before the release).
> > > > > >
> > > > > > In fact, I don't think it leaves a bad impression. Global windows
> > are
> > > > > > non-parallel windows. There are also parallel windows. Pick
what
> > you
> > > > need
> > > > > > and what works.
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > I think we agree on everything its more of a naming issue
:)
> > > > > > >
> > > > > > > I thought it might be misleading that global time windows
are
> > > > > > > "non-parallel" windows. We dont want to give a bad impression.
> > > (Also
> > > > we
> > > > > > > dont want them to think that every global window is parallel
> but
> > > > thats
> > > > > > not
> > > > > > > a problem here)
> > > > > > >
> > > > > > > Gyula
> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen <sewen@apache.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Okay, what is missing about the windowing in your
opinion?
> > > > > > > >
> > > > > > > > The core points of the document are:
> > > > > > > >
> > > > > > > >   - The parallel windows are per group only.
> > > > > > > >
> > > > > > > >   - The implementation of the parallel windows holds
window
> > data
> > > in
> > > > > the
> > > > > > > > group buffers.
> > > > > > > >
> > > > > > > >   - The global windows are non-parallel. May have
parallel
> > > > > > > pre-aggregation,
> > > > > > > > if they are time windows.
> > > > > > > >
> > > > > > > >   - Time may be operator time (timer thread), or watermark
> > time.
> > > > > > > Watermark
> > > > > > > > time can refer to ingress or event time.
> > > > > > > >
> > > > > > > >   - Windows that do not pre-aggregate may require
elements in
> > > > order.
> > > > > > Not
> > > > > > > > part of the first prototype.
> > > > > > > >
> > > > > > > > Do we agree on those points?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In general I like it, although the main difference
between
> > the
> > > > > > current
> > > > > > > > and
> > > > > > > > > the new one is the windowing and that is still
not very
> > clear.
> > > > > > > > >
> > > > > > > > > Where do we have the full stream time windows
for
> > > instance?(which
> > > > > is
> > > > > > > > > parallel but not keyed)
> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek
<
> > > > > > aljoscha@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 I like it as well.
> > > > > > > > > >
> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas
<
> > > > ktzoumas@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 from my side
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan
Ewen <
> > > > > sewen@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Do we have consensus on these
designs?
> > > > > > > > > > > >
> > > > > > > > > > > > If we have, we should get to implementing
this soon,
> > > > because
> > > > > > > > > basically
> > > > > > > > > > > all
> > > > > > > > > > > > streaming patches will have to
be revisited in light
> of
> > > > > this...
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM,
Gyula Fóra <
> > > > > > gyula.fora@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > You are right thats an important
issue.
> > > > > > > > > > > > >
> > > > > > > > > > > > > And I think we should also
do some renaming with
> the
> > > > > > > "iterations"
> > > > > > > > > > > because
> > > > > > > > > > > > > they are not really iterations
like in the batch
> case
> > > and
> > > > > it
> > > > > > > > might
> > > > > > > > > > > > confuse
> > > > > > > > > > > > > some users.
> > > > > > > > > > > > > Maybe we can call them loops
or cycles and rename
> the
> > > api
> > > > > > calls
> > > > > > > > to
> > > > > > > > > > make
> > > > > > > > > > > > it
> > > > > > > > > > > > > more intuitive what happens.
It is really just a
> > cyclic
> > > > > > > dataflow.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Aljoscha Krettek <aljoscha@apache.org>
ezt írta
> > > > (időpont:
> > > > > > > 2015.
> > > > > > > > > júl.
> > > > > > > > > > > 7.,
> > > > > > > > > > > > > K,
> > > > > > > > > > > > > 15:35):
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > I just noticed that
we don't have anything about
> > how
> > > > > > > iterations
> > > > > > > > > and
> > > > > > > > > > > > > > timestamps/watermarks
should interact.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Aljoscha
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at
23:56 Stephan Ewen <
> > > > > sewen@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi all!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > As many of you
know, there are a ongoing
> efforts
> > to
> > > > > > > > consolidate
> > > > > > > > > > the
> > > > > > > > > > > > > > > streaming API for
the next release, and then
> > > graduate
> > > > > it
> > > > > > > > (from
> > > > > > > > > > beta
> > > > > > > > > > > > > > > status).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the process
of this consolidation, we want
> to
> > > > > achieve
> > > > > > > the
> > > > > > > > > > > > following
> > > > > > > > > > > > > > > goals.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  - Make the code
more robust and simplify it in
> > > parts
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  - Clearly define
the semantics of the
> > constructs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  - Prepare it for
support of more advanced
> > > concepts,
> > > > > like
> > > > > > > > > > > > partitionable
> > > > > > > > > > > > > > > state, and event
time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  - Cut support
for certain corner cases that
> were
> > > > > > > prototyped,
> > > > > > > > > but
> > > > > > > > > > > > > turned
> > > > > > > > > > > > > > > out to be not efficiently
doable
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Based on prior
discussions on the mailing list,
> > > > > Aljoscha
> > > > > > > and
> > > > > > > > me
> > > > > > > > > > > > drafted
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > design documents
below, which outline how the
> > > > > > consolidated
> > > > > > > > API
> > > > > > > > > > > would
> > > > > > > > > > > > > > like.
> > > > > > > > > > > > > > > We focused in constructs,
time, and window
> > > semantics.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Design document
on how to restructure the
> > Streaming
> > > > > API:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Design document
on definitions of time, order,
> > and
> > > > the
> > > > > > > > > resulting
> > > > > > > > > > > > > > semantics:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Note: The design
of the interfaces and concepts
> > for
> > > > > > > advanced
> > > > > > > > > > state
> > > > > > > > > > > in
> > > > > > > > > > > > > > > functions is not
in here. That is part of a
> > > separate
> > > > > > design
> > > > > > > > > > > > discussion
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > orthogonal to the
designs drafted here.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please have a look
and voice questions and
> > > concerns.
> > > > > > Since
> > > > > > > we
> > > > > > > > > > > should
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > break the streaming
API more than once, we
> should
> > > > make
> > > > > > sure
> > > > > > > > > this
> > > > > > > > > > > > > > > consolidation brings
it into the shape we want
> it
> > > to
> > > > be
> > > > > > in.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Greetings,
> > > > > > > > > > > > > > > Stephan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message