Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
MIME-Version: 1.0
References: 
 <CANC1h_t75XGobCnksRVpuFYZyjk1xX7yp12o4BSCxoCkQxOKyQ@mail.gmail.com>
 <CANMXwW1Zfs8BfYvzDFHdHOj5Fp56XQy852wkB0TrVQnaw5qtLA@mail.gmail.com>
 <CA+faj9wBDLY_NK6VqngXrHDVkJ-ZDjQfB3Mzp8hUYhdubL+Z8A@mail.gmail.com>
 <CANC1h_suCN6gk6weUB4Yw8EoftyDFLgSbR1TO3C97_se0=unpQ@mail.gmail.com>
 <CAGWx-_sGoVE2rVujWCU-mFPU+_BDP8_RS-KswashOw2ZDhOxBw@mail.gmail.com>
 <CANMXwW25WqifUEA2-LgAQRSOb8QTEEg3uYcrKPos7p6RbBZ4Ug@mail.gmail.com>
 <CA+faj9zDcW=cwN02MEkPegNQ9tmaOYWTi8i4tgabDYUKC=6kdg@mail.gmail.com>
 <CANC1h_unoKE+xKAwC1C+n=TwzW78GVNJd1YCDQmXcTSaQ2M58g@mail.gmail.com>
 <CA+faj9wTAXnwfViL_KYuocSSu6h3BxfLi_=YNwm5L6mGRKzt1Q@mail.gmail.com>
 <CANC1h_tN1K5Bo=Omemg--VA_=3UzjpBx_oPE0ucNh+EXYOW+Zg@mail.gmail.com>
 <CA+faj9xRin7t9DR9Je4tMYBfePurRHYeKc0m8aoCW4FrW_J9TA@mail.gmail.com>
 <CAKADb_O=Gzbda=stn1=X6-xTomao5o4QO3g=e41imsuG4mPBWw@mail.gmail.com>
 <CA+faj9z_8+mS=qMgaAkFw_jEEurGEZ3ZfvC9+GR6uD1dLZ5VLA@mail.gmail.com>
 <CANC1h_v0tNo=G6xnRJ8mPWP1HsCkejb7ieFFB5oGnKB2E_9POQ@mail.gmail.com>
 <CA+faj9yX_8=MCTMjMQdMWOwt_u-T7UZcvbSdh4=S4+ptes0gfA@mail.gmail.com>
 <CANC1h_tW_aCYONf09_uYZKQ90CKEEPrOkYTojXqJ4jCVheBUoA@mail.gmail.com>
 <CANC1h_uwdhZcLh+M_Qfx5FNmOKDySGuTofr0Ut4O-myKcEvBYg@mail.gmail.com>
 <CANMXwW3FS0mRdMzM7Mr-3Y+5sBUJf5UVZmvEDF7BfbunHh6YZQ@mail.gmail.com>
In-Reply-To: 
 <CANMXwW3FS0mRdMzM7Mr-3Y+5sBUJf5UVZmvEDF7BfbunHh6YZQ@mail.gmail.com>
From: =?UTF-8?Q?Gyula_F=C3=B3ra?= <gyula.fora@gmail.com>
Date: Tue, 14 Jul 2015 10:40:09 +0000
Message-ID: 
 <CA+faj9xTH+MvsGQ7o2URwvE26vhzRhzUTbAQpYQyEy=6vghS+A@mail.gmail.com>
Subject: Re: Design documents for consolidated DataStream API
To: "dev@flink.apache.org" <dev@flink.apache.org>
Content-Type: multipart/alternative; boundary=089e0160bdaae498e0051ad37627

--089e0160bdaae498e0051ad37627
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I see your point, reduceByKey is much clearer.

The question is whether we want to introduce this inconsistency across the
two api-s or stick with what we have.
On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek <aljoscha@apache.org>
wrote:

> I agree, the groupBy, in the batch API is misleading, since a
> ds.groupBy().reduce() does not really build any groups, it is really a
> ds.keyBy().reduceByKey(). In the streaming API we can still fix this, IMH=
O.
>
> On Tue, 14 Jul 2015 at 10:56 Stephan Ewen <sewen@apache.org> wrote:
>
> > It is not a bit different than the batch API, because streaming semanti=
cs
> > are a bit different ;-)
> >
> > One good thing is that we can make things better that were sub-optimal =
in
> > the Batch API.
> >
> > On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <sewen@apache.org> wrote=
:
> >
> > > keyBy() does not do any grouping. Grouping in streams in not defined
> > > without windows.
> > >
> > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula F=C3=B3ra <gyula.fora@gmail.c=
om>
> > wrote:
> > >
> > >> If we only want to have either keyBy or groupBy, why not keep groupB=
y?
> > >> That
> > >> would be more consistent with the batch api.
> > >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <sewen@apache.org>
> wrote:
> > >>
> > >> > Concerning your comments:
> > >> >
> > >> > 1) In the new design, there is no grouping without windowing. The
> > >> > KeyedDataStream subsumes the grouping and key-ing for partitioned
> > state.
> > >> >
> > >> >     The keyBy() + window() makes a parallel grouped window
> > >> >     keyBy() alone allows access to partitioned state.
> > >> >
> > >> >     My thought was that this is simpler, because it needs not
> > groupBy()
> > >> and
> > >> > keyBy(), but one construct to handle both cases.
> > >> >
> > >> > 2) The discretization is a rough thought and is nothing for the
> short
> > >> term.
> > >> > It totally needs more thoughts. I put it there to have it as a
> sketch
> > >> for
> > >> > how to evolve this.
> > >> >
> > >> >     The idea is of course to not have a single data set, but a
> series
> > of
> > >> > data set. In each discrete time slice, the data set can be treated
> > like
> > >> a
> > >> > regular data set.
> > >> >
> > >> >     Let's kick off a separate design for the discretization. Joins
> are
> > >> good
> > >> > to talk about (data sets can be joined with data set), and I am su=
re
> > >> there
> > >> > are more questions coming up.
> > >> >
> > >> >
> > >> > Does that make sense?
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula F=C3=B3ra <gyula.fora@gmai=
l.com>
> > >> wrote:
> > >> >
> > >> > > I think Marton has some good points here.
> > >> > >
> > >> > > 1) Is KeyedDataStream a better name if this is only a renaming?
> > >> > >
> > >> > > 2) the discretize semantics is unclear indeed. Are we operating
> on a
> > >> > single
> > >> > > or sequence of datasets? If the latter why not call it something
> > else
> > >> > > (dstream). How are joins and other binary operators defined for
> > >> different
> > >> > > discretizations etc.
> > >> > > On Mon, Jul 13, 2015 at 7:37 PM M=C3=A1rton Balassi <
> mbalassi@apache.org
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Generally I agree with the new design. Two concerns:
> > >> > > >
> > >> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it the
> > >> latter a
> > >> > > > special case of the former?
> > >> > > >
> > >> > > > The KeyedDataStream as described in the design document is a b=
it
> > >> > unclear
> > >> > > > for me. It lists the following usages:
> > >> > > >   a) It is the first step in building a window stream, on top =
of
> > >> which
> > >> > > the
> > >> > > > grouped/windowed aggregation and reduce-style function can be
> > >> applied
> > >> > > >   b) It allows to use the "by-key" state of functions. Here,
> every
> > >> > record
> > >> > > > has access to a state that is scoped by its key. Key-scoped
> state
> > >> can
> > >> > be
> > >> > > > automatically redistributed and repartitioned.
> > >> > > >
> > >> > > > The code snippet describes a use case where the computation an=
d
> > the
> > >> > > access
> > >> > > > of the state is used the way currently the GroupedDataStream
> > should
> > >> > > work. I
> > >> > > > suppose this is the example for case b). Would case a) also
> window
> > >> > > elements
> > >> > > > by key? If yes, then this is practically a renaming and
> > enhancement
> > >> of
> > >> > > the
> > >> > > > GroupedDataStream functionality with keyed state. Then the
> > >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner,
> > >> > > > KeySelector)construction does not make much sense as the user
> only
> > >> > > operates
> > >> > > > within the scope of the keyselector and not the partitioner
> > anyway.
> > >> > > >
> > >> > > > I personally think KeyedDataStream as a name does not
> necessarily
> > >> > suggest
> > >> > > > that the records are grouped by key, it only suggests
> partitioning
> > >> by
> > >> > > key -
> > >> > > > at least for me. :)
> > >> > > >
> > >> > > > 2) The API for discretization is not convenient IMHO
> > >> > > >
> > >> > > > The discretization part declares that the output of
> > >> > > DataStream.discretize()
> > >> > > > is a sequence of DataSets. I love this approach, but then in t=
he
> > >> code
> > >> > > > snippet the return value of this function is simply a DataSet
> and
> > >> uses
> > >> > it
> > >> > > > as such. The take home message of that code is the following:
> this
> > >> is
> > >> > > > actually the way you would like to program on these sequence o=
f
> > >> > DataSets,
> > >> > > > most probably you would like to do the same with each of them.
> If
> > >> that
> > >> > is
> > >> > > > the case we should provide a nice utility for that. I think
> Spark
> > >> > > > Streaming's DStream.foreachRDD() is fairly useful for this
> > purpose.
> > >> > > >
> > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula F=C3=B3ra <
> gyula.fora@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > +1
> > >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen <
> sewen@apache.org>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > If naming is the only concern, then we should go ahead,
> > because
> > >> we
> > >> > > can
> > >> > > > > > change names easily (before the release).
> > >> > > > > >
> > >> > > > > > In fact, I don't think it leaves a bad impression. Global
> > >> windows
> > >> > are
> > >> > > > > > non-parallel windows. There are also parallel windows. Pic=
k
> > what
> > >> > you
> > >> > > > need
> > >> > > > > > and what works.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula F=C3=B3ra <
> > >> gyula.fora@gmail.com>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I think we agree on everything its more of a naming issu=
e
> :)
> > >> > > > > > >
> > >> > > > > > > I thought it might be misleading that global time window=
s
> > are
> > >> > > > > > > "non-parallel" windows. We dont want to give a bad
> > impression.
> > >> > > (Also
> > >> > > > we
> > >> > > > > > > dont want them to think that every global window is
> parallel
> > >> but
> > >> > > > thats
> > >> > > > > > not
> > >> > > > > > > a problem here)
> > >> > > > > > >
> > >> > > > > > > Gyula
> > >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen <
> > >> sewen@apache.org>
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Okay, what is missing about the windowing in your
> opinion?
> > >> > > > > > > >
> > >> > > > > > > > The core points of the document are:
> > >> > > > > > > >
> > >> > > > > > > >   - The parallel windows are per group only.
> > >> > > > > > > >
> > >> > > > > > > >   - The implementation of the parallel windows holds
> > window
> > >> > data
> > >> > > in
> > >> > > > > the
> > >> > > > > > > > group buffers.
> > >> > > > > > > >
> > >> > > > > > > >   - The global windows are non-parallel. May have
> parallel
> > >> > > > > > > pre-aggregation,
> > >> > > > > > > > if they are time windows.
> > >> > > > > > > >
> > >> > > > > > > >   - Time may be operator time (timer thread), or
> watermark
> > >> > time.
> > >> > > > > > > Watermark
> > >> > > > > > > > time can refer to ingress or event time.
> > >> > > > > > > >
> > >> > > > > > > >   - Windows that do not pre-aggregate may require
> elements
> > >> in
> > >> > > > order.
> > >> > > > > > Not
> > >> > > > > > > > part of the first prototype.
> > >> > > > > > > >
> > >> > > > > > > > Do we agree on those points?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula F=C3=B3ra <
> > >> > > gyula.fora@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > In general I like it, although the main difference
> > between
> > >> > the
> > >> > > > > > current
> > >> > > > > > > > and
> > >> > > > > > > > > the new one is the windowing and that is still not
> very
> > >> > clear.
> > >> > > > > > > > >
> > >> > > > > > > > > Where do we have the full stream time windows for
> > >> > > instance?(which
> > >> > > > > is
> > >> > > > > > > > > parallel but not keyed)
> > >> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek <
> > >> > > > > > aljoscha@apache.org>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > +1 I like it as well.
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas <
> > >> > > > ktzoumas@apache.org
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > +1 from my side
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen <
> > >> > > > > sewen@apache.org>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Do we have consensus on these designs?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > If we have, we should get to implementing this
> > soon,
> > >> > > > because
> > >> > > > > > > > > basically
> > >> > > > > > > > > > > all
> > >> > > > > > > > > > > > streaming patches will have to be revisited in
> > >> light of
> > >> > > > > this...
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula F=C3=B3r=
a <
> > >> > > > > > gyula.fora@gmail.com
> > >> > > > > > > >
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > You are right thats an important issue.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > And I think we should also do some renaming
> with
> > >> the
> > >> > > > > > > "iterations"
> > >> > > > > > > > > > > because
> > >> > > > > > > > > > > > > they are not really iterations like in the
> batch
> > >> case
> > >> > > and
> > >> > > > > it
> > >> > > > > > > > might
> > >> > > > > > > > > > > > confuse
> > >> > > > > > > > > > > > > some users.
> > >> > > > > > > > > > > > > Maybe we can call them loops or cycles and
> > rename
> > >> the
> > >> > > api
> > >> > > > > > calls
> > >> > > > > > > > to
> > >> > > > > > > > > > make
> > >> > > > > > > > > > > > it
> > >> > > > > > > > > > > > > more intuitive what happens. It is really
> just a
> > >> > cyclic
> > >> > > > > > > dataflow.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Aljoscha Krettek <aljoscha@apache.org> ezt
> =C3=ADrta
> > >> > > > (id=C5=91pont:
> > >> > > > > > > 2015.
> > >> > > > > > > > > j=C3=BAl.
> > >> > > > > > > > > > > 7.,
> > >> > > > > > > > > > > > > K,
> > >> > > > > > > > > > > > > 15:35):
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi,
> > >> > > > > > > > > > > > > > I just noticed that we don't have anything
> > about
> > >> > how
> > >> > > > > > > iterations
> > >> > > > > > > > > and
> > >> > > > > > > > > > > > > > timestamps/watermarks should interact.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Cheers,
> > >> > > > > > > > > > > > > > Aljoscha
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan Ewen <
> > >> > > > > sewen@apache.org
> > >> > > > > > >
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Hi all!
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > As many of you know, there are a ongoing
> > >> efforts
> > >> > to
> > >> > > > > > > > consolidate
> > >> > > > > > > > > > the
> > >> > > > > > > > > > > > > > > streaming API for the next release, and
> then
> > >> > > graduate
> > >> > > > > it
> > >> > > > > > > > (from
> > >> > > > > > > > > > beta
> > >> > > > > > > > > > > > > > > status).
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > In the process of this consolidation, we
> > want
> > >> to
> > >> > > > > achieve
> > >> > > > > > > the
> > >> > > > > > > > > > > > following
> > >> > > > > > > > > > > > > > > goals.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >  - Make the code more robust and simplif=
y
> it
> > >> in
> > >> > > parts
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >  - Clearly define the semantics of the
> > >> > constructs.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >  - Prepare it for support of more advanc=
ed
> > >> > > concepts,
> > >> > > > > like
> > >> > > > > > > > > > > > partitionable
> > >> > > > > > > > > > > > > > > state, and event time.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >  - Cut support for certain corner cases
> that
> > >> were
> > >> > > > > > > prototyped,
> > >> > > > > > > > > but
> > >> > > > > > > > > > > > > turned
> > >> > > > > > > > > > > > > > > out to be not efficiently doable
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Based on prior discussions on the mailin=
g
> > >> list,
> > >> > > > > Aljoscha
> > >> > > > > > > and
> > >> > > > > > > > me
> > >> > > > > > > > > > > > drafted
> > >> > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > design documents below, which outline ho=
w
> > the
> > >> > > > > > consolidated
> > >> > > > > > > > API
> > >> > > > > > > > > > > would
> > >> > > > > > > > > > > > > > like.
> > >> > > > > > > > > > > > > > > We focused in constructs, time, and wind=
ow
> > >> > > semantics.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Design document on how to restructure th=
e
> > >> > Streaming
> > >> > > > > API:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+=
on+Streams
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Design document on definitions of time,
> > order,
> > >> > and
> > >> > > > the
> > >> > > > > > > > > resulting
> > >> > > > > > > > > > > > > > semantics:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Strea=
ms
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Note: The design of the interfaces and
> > >> concepts
> > >> > for
> > >> > > > > > > advanced
> > >> > > > > > > > > > state
> > >> > > > > > > > > > > in
> > >> > > > > > > > > > > > > > > functions is not in here. That is part o=
f
> a
> > >> > > separate
> > >> > > > > > design
> > >> > > > > > > > > > > > discussion
> > >> > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > orthogonal to the designs drafted here.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Please have a look and voice questions a=
nd
> > >> > > concerns.
> > >> > > > > > Since
> > >> > > > > > > we
> > >> > > > > > > > > > > should
> > >> > > > > > > > > > > > > not
> > >> > > > > > > > > > > > > > > break the streaming API more than once, =
we
> > >> should
> > >> > > > make
> > >> > > > > > sure
> > >> > > > > > > > > this
> > >> > > > > > > > > > > > > > > consolidation brings it into the shape w=
e
> > >> want it
> > >> > > to
> > >> > > > be
> > >> > > > > > in.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Greetings,
> > >> > > > > > > > > > > > > > > Stephan
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

--089e0160bdaae498e0051ad37627--