flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Márton Balassi <balassi.mar...@gmail.com>
Subject Re: Design documents for consolidated DataStream API
Date Wed, 15 Jul 2015 06:26:33 GMT
Ok, thanks for the clarification. Let us try to document it in a way that
those thoughts are reflected then. Discretization will not happen upfront
we can wait with that.

On Tue, Jul 14, 2015 at 12:49 PM, Stephan Ewen <sewen@apache.org> wrote:

> There is no inconsistency between the Batch and Streaming API. They have
> different semantics - the batch API is implicitly always windowed.
>
> There is a naming difference between the two APIs.
>
> There is a strong inconsistency within the Streaming API right now.
> Grouping and aggregating without windows is plain dangerous in streaming.
> It either blows up or is undefined in its behavior.
>
>
>
> On Tue, Jul 14, 2015 at 12:40 PM, Gyula Fóra <gyula.fora@gmail.com> wrote:
>
> > I see your point, reduceByKey is much clearer.
> >
> > The question is whether we want to introduce this inconsistency across
> the
> > two api-s or stick with what we have.
> > On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek <aljoscha@apache.org>
> > wrote:
> >
> > > I agree, the groupBy, in the batch API is misleading, since a
> > > ds.groupBy().reduce() does not really build any groups, it is really a
> > > ds.keyBy().reduceByKey(). In the streaming API we can still fix this,
> > IMHO.
> > >
> > > On Tue, 14 Jul 2015 at 10:56 Stephan Ewen <sewen@apache.org> wrote:
> > >
> > > > It is not a bit different than the batch API, because streaming
> > semantics
> > > > are a bit different ;-)
> > > >
> > > > One good thing is that we can make things better that were
> sub-optimal
> > in
> > > > the Batch API.
> > > >
> > > > On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <sewen@apache.org>
> > wrote:
> > > >
> > > > > keyBy() does not do any grouping. Grouping in streams in not
> defined
> > > > > without windows.
> > > > >
> > > > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <gyula.fora@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> If we only want to have either keyBy or groupBy, why not keep
> > groupBy?
> > > > >> That
> > > > >> would be more consistent with the batch api.
> > > > >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <sewen@apache.org>
> > > wrote:
> > > > >>
> > > > >> > Concerning your comments:
> > > > >> >
> > > > >> > 1) In the new design, there is no grouping without windowing.
> The
> > > > >> > KeyedDataStream subsumes the grouping and key-ing for
> partitioned
> > > > state.
> > > > >> >
> > > > >> >     The keyBy() + window() makes a parallel grouped window
> > > > >> >     keyBy() alone allows access to partitioned state.
> > > > >> >
> > > > >> >     My thought was that this is simpler, because it needs
not
> > > > groupBy()
> > > > >> and
> > > > >> > keyBy(), but one construct to handle both cases.
> > > > >> >
> > > > >> > 2) The discretization is a rough thought and is nothing
for the
> > > short
> > > > >> term.
> > > > >> > It totally needs more thoughts. I put it there to have it
as a
> > > sketch
> > > > >> for
> > > > >> > how to evolve this.
> > > > >> >
> > > > >> >     The idea is of course to not have a single data set,
but a
> > > series
> > > > of
> > > > >> > data set. In each discrete time slice, the data set can
be
> treated
> > > > like
> > > > >> a
> > > > >> > regular data set.
> > > > >> >
> > > > >> >     Let's kick off a separate design for the discretization.
> Joins
> > > are
> > > > >> good
> > > > >> > to talk about (data sets can be joined with data set), and
I am
> > sure
> > > > >> there
> > > > >> > are more questions coming up.
> > > > >> >
> > > > >> >
> > > > >> > Does that make sense?
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > I think Marton has some good points here.
> > > > >> > >
> > > > >> > > 1) Is KeyedDataStream a better name if this is only
a
> renaming?
> > > > >> > >
> > > > >> > > 2) the discretize semantics is unclear indeed. Are
we
> operating
> > > on a
> > > > >> > single
> > > > >> > > or sequence of datasets? If the latter why not call
it
> something
> > > > else
> > > > >> > > (dstream). How are joins and other binary operators
defined
> for
> > > > >> different
> > > > >> > > discretizations etc.
> > > > >> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi <
> > > mbalassi@apache.org
> > > > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Generally I agree with the new design. Two concerns:
> > > > >> > > >
> > > > >> > > > 1) Does KeyedDataStream replace GroupedDataStream
or is it
> the
> > > > >> latter a
> > > > >> > > > special case of the former?
> > > > >> > > >
> > > > >> > > > The KeyedDataStream as described in the design
document is a
> > bit
> > > > >> > unclear
> > > > >> > > > for me. It lists the following usages:
> > > > >> > > >   a) It is the first step in building a window
stream, on
> top
> > of
> > > > >> which
> > > > >> > > the
> > > > >> > > > grouped/windowed aggregation and reduce-style
function can
> be
> > > > >> applied
> > > > >> > > >   b) It allows to use the "by-key" state of functions.
Here,
> > > every
> > > > >> > record
> > > > >> > > > has access to a state that is scoped by its key.
Key-scoped
> > > state
> > > > >> can
> > > > >> > be
> > > > >> > > > automatically redistributed and repartitioned.
> > > > >> > > >
> > > > >> > > > The code snippet describes a use case where the
computation
> > and
> > > > the
> > > > >> > > access
> > > > >> > > > of the state is used the way currently the GroupedDataStream
> > > > should
> > > > >> > > work. I
> > > > >> > > > suppose this is the example for case b). Would
case a) also
> > > window
> > > > >> > > elements
> > > > >> > > > by key? If yes, then this is practically a renaming
and
> > > > enhancement
> > > > >> of
> > > > >> > > the
> > > > >> > > > GroupedDataStream functionality with keyed state.
Then the
> > > > >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner,
> > > > >> > > > KeySelector)construction does not make much sense
as the
> user
> > > only
> > > > >> > > operates
> > > > >> > > > within the scope of the keyselector and not the
partitioner
> > > > anyway.
> > > > >> > > >
> > > > >> > > > I personally think KeyedDataStream as a name does
not
> > > necessarily
> > > > >> > suggest
> > > > >> > > > that the records are grouped by key, it only suggests
> > > partitioning
> > > > >> by
> > > > >> > > key -
> > > > >> > > > at least for me. :)
> > > > >> > > >
> > > > >> > > > 2) The API for discretization is not convenient
IMHO
> > > > >> > > >
> > > > >> > > > The discretization part declares that the output
of
> > > > >> > > DataStream.discretize()
> > > > >> > > > is a sequence of DataSets. I love this approach,
but then in
> > the
> > > > >> code
> > > > >> > > > snippet the return value of this function is simply
a
> DataSet
> > > and
> > > > >> uses
> > > > >> > it
> > > > >> > > > as such. The take home message of that code is
the
> following:
> > > this
> > > > >> is
> > > > >> > > > actually the way you would like to program on
these sequence
> > of
> > > > >> > DataSets,
> > > > >> > > > most probably you would like to do the same with
each of
> them.
> > > If
> > > > >> that
> > > > >> > is
> > > > >> > > > the case we should provide a nice utility for
that. I think
> > > Spark
> > > > >> > > > Streaming's DStream.foreachRDD() is fairly useful
for this
> > > > purpose.
> > > > >> > > >
> > > > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra <
> > > gyula.fora@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > +1
> > > > >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen
<
> > > sewen@apache.org>
> > > > >> > wrote:
> > > > >> > > > >
> > > > >> > > > > > If naming is the only concern, then
we should go ahead,
> > > > because
> > > > >> we
> > > > >> > > can
> > > > >> > > > > > change names easily (before the release).
> > > > >> > > > > >
> > > > >> > > > > > In fact, I don't think it leaves a bad
impression.
> Global
> > > > >> windows
> > > > >> > are
> > > > >> > > > > > non-parallel windows. There are also
parallel windows.
> > Pick
> > > > what
> > > > >> > you
> > > > >> > > > need
> > > > >> > > > > > and what works.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula
Fóra <
> > > > >> gyula.fora@gmail.com>
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > I think we agree on everything
its more of a naming
> > issue
> > > :)
> > > > >> > > > > > >
> > > > >> > > > > > > I thought it might be misleading
that global time
> > windows
> > > > are
> > > > >> > > > > > > "non-parallel" windows. We dont
want to give a bad
> > > > impression.
> > > > >> > > (Also
> > > > >> > > > we
> > > > >> > > > > > > dont want them to think that every
global window is
> > > parallel
> > > > >> but
> > > > >> > > > thats
> > > > >> > > > > > not
> > > > >> > > > > > > a problem here)
> > > > >> > > > > > >
> > > > >> > > > > > > Gyula
> > > > >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM
Stephan Ewen <
> > > > >> sewen@apache.org>
> > > > >> > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Okay, what is missing about
the windowing in your
> > > opinion?
> > > > >> > > > > > > >
> > > > >> > > > > > > > The core points of the document
are:
> > > > >> > > > > > > >
> > > > >> > > > > > > >   - The parallel windows are
per group only.
> > > > >> > > > > > > >
> > > > >> > > > > > > >   - The implementation of
the parallel windows holds
> > > > window
> > > > >> > data
> > > > >> > > in
> > > > >> > > > > the
> > > > >> > > > > > > > group buffers.
> > > > >> > > > > > > >
> > > > >> > > > > > > >   - The global windows are
non-parallel. May have
> > > parallel
> > > > >> > > > > > > pre-aggregation,
> > > > >> > > > > > > > if they are time windows.
> > > > >> > > > > > > >
> > > > >> > > > > > > >   - Time may be operator time
(timer thread), or
> > > watermark
> > > > >> > time.
> > > > >> > > > > > > Watermark
> > > > >> > > > > > > > time can refer to ingress
or event time.
> > > > >> > > > > > > >
> > > > >> > > > > > > >   - Windows that do not pre-aggregate
may require
> > > elements
> > > > >> in
> > > > >> > > > order.
> > > > >> > > > > > Not
> > > > >> > > > > > > > part of the first prototype.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Do we agree on those points?
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50
PM, Gyula Fóra <
> > > > >> > > gyula.fora@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > In general I like it,
although the main difference
> > > > between
> > > > >> > the
> > > > >> > > > > > current
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > the new one is the windowing
and that is still not
> > > very
> > > > >> > clear.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Where do we have the
full stream time windows for
> > > > >> > > instance?(which
> > > > >> > > > > is
> > > > >> > > > > > > > > parallel but not keyed)
> > > > >> > > > > > > > > On Mon, Jul 13, 2015
at 4:28 PM Aljoscha Krettek <
> > > > >> > > > > > aljoscha@apache.org>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > +1 I like it as
well.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Mon, 13 Jul 2015
at 16:17 Kostas Tzoumas <
> > > > >> > > > ktzoumas@apache.org
> > > > >> > > > > >
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > +1 from my
side
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Mon, Jul
13, 2015 at 4:15 PM, Stephan Ewen
> <
> > > > >> > > > > sewen@apache.org>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Do we
have consensus on these designs?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > If we
have, we should get to implementing
> this
> > > > soon,
> > > > >> > > > because
> > > > >> > > > > > > > > basically
> > > > >> > > > > > > > > > > all
> > > > >> > > > > > > > > > > > streaming
patches will have to be revisited
> in
> > > > >> light of
> > > > >> > > > > this...
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Tue,
Jul 7, 2015 at 3:41 PM, Gyula Fóra <
> > > > >> > > > > > gyula.fora@gmail.com
> > > > >> > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > You
are right thats an important issue.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > And
I think we should also do some
> renaming
> > > with
> > > > >> the
> > > > >> > > > > > > "iterations"
> > > > >> > > > > > > > > > > because
> > > > >> > > > > > > > > > > > > they
are not really iterations like in the
> > > batch
> > > > >> case
> > > > >> > > and
> > > > >> > > > > it
> > > > >> > > > > > > > might
> > > > >> > > > > > > > > > > > confuse
> > > > >> > > > > > > > > > > > > some
users.
> > > > >> > > > > > > > > > > > > Maybe
we can call them loops or cycles and
> > > > rename
> > > > >> the
> > > > >> > > api
> > > > >> > > > > > calls
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > make
> > > > >> > > > > > > > > > > > it
> > > > >> > > > > > > > > > > > > more
intuitive what happens. It is really
> > > just a
> > > > >> > cyclic
> > > > >> > > > > > > dataflow.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Aljoscha
Krettek <aljoscha@apache.org>
> ezt
> > > írta
> > > > >> > > > (időpont:
> > > > >> > > > > > > 2015.
> > > > >> > > > > > > > > júl.
> > > > >> > > > > > > > > > > 7.,
> > > > >> > > > > > > > > > > > > K,
> > > > >> > > > > > > > > > > > > 15:35):
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
Hi,
> > > > >> > > > > > > > > > > > > >
I just noticed that we don't have
> anything
> > > > about
> > > > >> > how
> > > > >> > > > > > > iterations
> > > > >> > > > > > > > > and
> > > > >> > > > > > > > > > > > > >
timestamps/watermarks should interact.
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
Cheers,
> > > > >> > > > > > > > > > > > > >
Aljoscha
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
On Mon, 6 Jul 2015 at 23:56 Stephan
> Ewen <
> > > > >> > > > > sewen@apache.org
> > > > >> > > > > > >
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
> Hi all!
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> As many of you know, there are a
> ongoing
> > > > >> efforts
> > > > >> > to
> > > > >> > > > > > > > consolidate
> > > > >> > > > > > > > > > the
> > > > >> > > > > > > > > > > > > >
> streaming API for the next release,
> and
> > > then
> > > > >> > > graduate
> > > > >> > > > > it
> > > > >> > > > > > > > (from
> > > > >> > > > > > > > > > beta
> > > > >> > > > > > > > > > > > > >
> status).
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> In the process of this consolidation,
> we
> > > > want
> > > > >> to
> > > > >> > > > > achieve
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > > > following
> > > > >> > > > > > > > > > > > > >
> goals.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>  - Make the code more robust and
> > simplify
> > > it
> > > > >> in
> > > > >> > > parts
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>  - Clearly define the semantics of the
> > > > >> > constructs.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>  - Prepare it for support of more
> > advanced
> > > > >> > > concepts,
> > > > >> > > > > like
> > > > >> > > > > > > > > > > > partitionable
> > > > >> > > > > > > > > > > > > >
> state, and event time.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>  - Cut support for certain corner
> cases
> > > that
> > > > >> were
> > > > >> > > > > > > prototyped,
> > > > >> > > > > > > > > but
> > > > >> > > > > > > > > > > > > turned
> > > > >> > > > > > > > > > > > > >
> out to be not efficiently doable
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Based on prior discussions on the
> > mailing
> > > > >> list,
> > > > >> > > > > Aljoscha
> > > > >> > > > > > > and
> > > > >> > > > > > > > me
> > > > >> > > > > > > > > > > > drafted
> > > > >> > > > > > > > > > > > > >
the
> > > > >> > > > > > > > > > > > > >
> design documents below, which outline
> > how
> > > > the
> > > > >> > > > > > consolidated
> > > > >> > > > > > > > API
> > > > >> > > > > > > > > > > would
> > > > >> > > > > > > > > > > > > >
like.
> > > > >> > > > > > > > > > > > > >
> We focused in constructs, time, and
> > window
> > > > >> > > semantics.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Design document on how to restructure
> > the
> > > > >> > Streaming
> > > > >> > > > > API:
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Design document on definitions of
> time,
> > > > order,
> > > > >> > and
> > > > >> > > > the
> > > > >> > > > > > > > > resulting
> > > > >> > > > > > > > > > > > > >
semantics:
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Note: The design of the interfaces and
> > > > >> concepts
> > > > >> > for
> > > > >> > > > > > > advanced
> > > > >> > > > > > > > > > state
> > > > >> > > > > > > > > > > in
> > > > >> > > > > > > > > > > > > >
> functions is not in here. That is part
> > of
> > > a
> > > > >> > > separate
> > > > >> > > > > > design
> > > > >> > > > > > > > > > > > discussion
> > > > >> > > > > > > > > > > > > >
and
> > > > >> > > > > > > > > > > > > >
> orthogonal to the designs drafted
> here.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Please have a look and voice questions
> > and
> > > > >> > > concerns.
> > > > >> > > > > > Since
> > > > >> > > > > > > we
> > > > >> > > > > > > > > > > should
> > > > >> > > > > > > > > > > > > not
> > > > >> > > > > > > > > > > > > >
> break the streaming API more than
> once,
> > we
> > > > >> should
> > > > >> > > > make
> > > > >> > > > > > sure
> > > > >> > > > > > > > > this
> > > > >> > > > > > > > > > > > > >
> consolidation brings it into the shape
> > we
> > > > >> want it
> > > > >> > > to
> > > > >> > > > be
> > > > >> > > > > > in.
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> Greetings,
> > > > >> > > > > > > > > > > > > >
> Stephan
> > > > >> > > > > > > > > > > > > >
>
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message