flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Neumann <mneum...@sics.se>
Subject Re: Some thoughts about the lower-level Flink APIs
Date Wed, 17 Aug 2016 16:16:08 GMT
I agree with Vasia that for data scientist it's likely easier to learn the
high-level api. I like the material from
http://dataartisans.github.io/flink-training/ but all of them focus on the
high level api.

Maybe we could have a guide (blog post, lecture, whatever) on how to get
into Flink as a Storm guy. That would allow people with that background to
directly dive into the lower level api working with models similar to what
they were used to. I would volunteer but I'm not familiar with Storm.

I for my part, always had to use some lower level api in all of my
application, most of the time pestering Aljioscha about the details. So
either I'm the exception or there is a need for more complex examples
showcasing the lower level api methods.
One of the things I have been using in several pipelines so far is
extracting the start and end timestamp from a window adding it to the
window aggregate. Maybe something like this could be a useful example to
include into the training.

Side question:
I assume there are recurring design patterns in stream applications user
develop. Is there any chance we will be able to identify or create some
design patterns (similar to java design pattern). That would make it easier
to use the lower level api and might help people to avoid pitfalls like the
one Alijosha mentioned.

cheers Martin
PS: I hope its fine for me to butt into the discussion like this.

On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <ewright@live.com> wrote:

> Jamie,
> I think you raise a valid concern but I would hesitate to accept the
> suggestion that the low-level API be promoted to app developers.
>
> Higher-level abstractions tend to be more constrained and more optimized,
> whereas lower-level abstractions tend to be more powerful, be more
> laborious to use and provide the system with less knowledge.   It is a
> classic tradeoff.
>
> I think it important to consider, what are the important/distinguishing
> characteristics of the Flink framework.    Exactly-once guarantees,
> event-time support, support for job upgrade without data loss, fault
> tolerance, etc.    I’m speculating that the high-level abstraction provided
> to app developers is probably needed to retain those charactistics.
>
> I think Vasia makes a good point that SQL might be a good alternative way
> to ease into Flink.
>
> Finally, I believe the low-level API is primarily intended for extension
> purposes (connectors, operations, etc) not app development.    It could use
> better documentation to ensure that third-party extensions support those
> key characteristics.
>
> -Eron
>
> > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com>
> wrote:
> >
> > Hi Jamie,
> >
> > thanks for sharing your thoughts on this! You're raising some interesting
> > points.
> >
> > Whether users find the lower-level primitives more intuitive depends on
> > their background I believe. From what I've seen, if users are coming from
> > the S4/Storm world and are used to the "compositional" way of streaming,
> > then indeed it's easier for them to think and operate on that level.
> These
> > are usually people who have seen/built streaming things before trying out
> > Flink.
> > But if we're talking about analysts and people coming from the "batch"
> way
> > of thinking or people used to working with SQL/python, then the
> > higher-level declarative API is probably easier to understand.
> >
> > I do think that we should make the lower-level API more visible and
> > document it properly, but I'm not sure if we should teach Flink on this
> > level first. I think that presenting it as a set of "advanced" features
> > makes more sense actually.
> >
> > Cheers,
> > -Vasia.
> >
> > On 16 August 2016 at 04:24, Jamie Grier <jamie@data-artisans.com> wrote:
> >
> >> You lost me at lattice, Aljoscha ;)
> >>
> >> I do think something like the more powerful N-way FlatMap w/ Timers
> >> Aljoscha is describing here would probably solve most of the problem.
> >> Often Flink's higher level primitives work well for people and that's
> >> great.  It's just that I also spend a fair amount of time discussing
> with
> >> people how to map what they know they want to do onto operations that
> >> aren't a perfect fit and it sometimes liberates them when they realize
> they
> >> can just implement it the way they want by dropping down a level.  They
> >> usually don't go there themselves, though.
> >>
> >> I mention teaching this "first" and then the higher layers I guess
> because
> >> that's just a matter of teaching philosophy.  I think it's good to to
> see
> >> the basic operations that are available first and then understand that
> the
> >> other abstractions are built on top of that.  That way you're not
> afraid to
> >> drop-down to basics when you know what you want to get done.
> >>
> >> -Jamie
> >>
> >>
> >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <aljoscha@apache.org>
> >> wrote:
> >>
> >>> Hi All,
> >>> I also thought about this recently. A good think would be to add a good
> >>> user facing operator that behaves more or less like an enhanced FlatMap
> >>> with multiple inputs, multiple outputs, state access and keyed timers.
> >> I'm
> >>> a bit hesitant, though, since users rarely think about the implications
> >>> that come with state updating and out-of-order events. If you don't
> >>> implement a stateful operator correctly you have pretty much arbitrary
> >>> results.
> >>>
> >>> The problem with out-of-order event arrival and state update is that
> the
> >>> state basically has to monotonically transition "upwards" through a
> >> lattice
> >>> for the computation to make sense. I know this sounds rather
> theoretical
> >> so
> >>> I'll try to explain with an example. Say you have an operator that
> waits
> >>> for timestamped elements A, B, C to arrive in timestamp order and then
> >> does
> >>> some processing. The naive approach would be to have a small state
> >> machine
> >>> that tracks what element you have seen so far. The state machine has
> >> three
> >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is
> supposed
> >>> to traverse these states linearly as the elements arrive. This doesn't
> >>> work, however, when elements arrive in an order that does not match
> their
> >>> timestamp order. What the user should do is to have a "Set" state that
> >>> keeps track of the elements that it has seen. Once it has seen {A, B,
> C}
> >>> the operator must check the timestamps and then do the processing, if
> >>> required. The set of possible combinations of A, B, and C forms a
> lattice
> >>> when combined with the "subset" operation. And traversal through these
> >> sets
> >>> is monotonically "upwards" so it works regardless of the order that the
> >>> elements arrive in. (I recently pointed this out on the Beam mailing
> list
> >>> and Kenneth Knowles rightly pointed out that what I was describing was
> in
> >>> fact a lattice.)
> >>>
> >>> I know this is a bit off-topic but I think it's very easy for users to
> >>> write wrong operations when they are dealing with state. We should
> still
> >>> have a good API for it, though. Just wanted to make people aware of
> this.
> >>>
> >>> Cheers,
> >>> Aljoscha
> >>>
> >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <mjsax@apache.org> wrote:
> >>>
> >>>> It really depends on the skill level of the developer. Using low-level
> >>>> API requires to think about many details (eg. state handling etc.)
> that
> >>>> could be done wrong.
> >>>>
> >>>> As Flink gets a broader community, more people will use it who might
> >> not
> >>>> have the required skill level to deal with low-level API. For more
> >>>> trained uses, it is of course a powerful tool!
> >>>>
> >>>> I guess it boils down to the question, what type of developer Flink
> >>>> targets, if low-level API should be offensive advertised or not. Also
> >>>> keep in mind, that many people criticized Storm's low-level API as
> hard
> >>>> to program etc.
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote:
> >>>>> Hi Jamie,
> >>>>>
> >>>>> I agree that it is often much easier to work on the lower level
APIs
> >> if
> >>>> you
> >>>>> know what you are doing.
> >>>>>
> >>>>> I think it would be nice to have very clean abstractions on that
> >> level
> >>> so
> >>>>> we could teach this to the users first but currently I thinm its
not
> >>> easy
> >>>>> enough to be good starting point.
> >>>>>
> >>>>> The user needs to understand a lot about the system if the dont
want
> >> to
> >>>>> hurt other parts of the pipeline. For insance working with the
> >>>>> streamrecords, propagating watermarks, working with state internals
> >>>>>
> >>>>> This all might be overwhelming at the first glance. But maybe we
can
> >>> slim
> >>>>> some abstractions down to the point where this becomes kind of the
> >>>>> extension of the RichFunctions.
> >>>>>
> >>>>> Cheers,
> >>>>> Gyula
> >>>>>
> >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <jamie@data-artisans.com>
> >>> wrote:
> >>>>>
> >>>>>> Hey all,
> >>>>>>
> >>>>>> I've noticed a few times now when trying to help users implement
> >>>> particular
> >>>>>> things in the Flink API that it can be complicated to map what
they
> >>> know
> >>>>>> they are trying to do onto higher-level Flink concepts such
as
> >>>> windowing or
> >>>>>> Connect/CoFlatMap/ValueState, etc.
> >>>>>>
> >>>>>> At some point it just becomes easier to think about writing
a Flink
> >>>>>> operator yourself that is integrated into the pipeline with
a
> >>>> transform()
> >>>>>> call.
> >>>>>>
> >>>>>> It can just be easier to think at a more basic level.  For example
I
> >>> can
> >>>>>> write an operator that can consume one or two input streams
(should
> >>>>>> probably be N), update state which is managed for me fault
> >> tolerantly,
> >>>> and
> >>>>>> output elements or setup timers/triggers that give me callbacks
from
> >>>> which
> >>>>>> I can also update state or emit elements.
> >>>>>>
> >>>>>> When you think at this level you realize you can program just
about
> >>>>>> anything you want.  You can create whatever fault-tolerant data
> >>>> structures
> >>>>>> you want, and easily execute robust stateful computation over
data
> >>>> streams
> >>>>>> at scale.  This is the real technology and power of Flink IMO.
> >>>>>>
> >>>>>> Also, at this level I don't have to think about the complexities
of
> >>>>>> windowing semantics, learn as much API, etc.  I can easily have
some
> >>>> inputs
> >>>>>> that are broadcast, others that are keyed, manage my own state
in
> >>>> whatever
> >>>>>> data structure makes sense, etc.  If I know exactly what I actually
> >>>> want to
> >>>>>> do I can just do it with the full power of my chosen language,
data
> >>>>>> structures, etc.  I'm not "restricted" to trying to map everything
> >>> onto
> >>>>>> higher-level Flink constructs which is sometimes actually more
> >>>> complicated.
> >>>>>>
> >>>>>> Programming at this level is actually fairly easy to do but
people
> >>> seem
> >>>> a
> >>>>>> bit afraid of this level of the API.  They think of it as low-level
> >> or
> >>>>>> custom hacking..
> >>>>>>
> >>>>>> Anyway, I guess my thought is this..  Should we explain Flink
to
> >>> people
> >>>> at
> >>>>>> this level *first*?  Show that you have nearly unlimited power
and
> >>>>>> flexibility to build what you want *and only then* from there
> >> explain
> >>>> the
> >>>>>> higher level APIs they can use *if* those match their use cases
> >> well.
> >>>>>>
> >>>>>> Would this better demonstrate to people the power of Flink and
maybe
> >>>>>> *liberate* them a bit from feeling they have to map their problem
> >>> onto a
> >>>>>> more complex set of higher level primitives?  I see people trying
to
> >>>>>> shoe-horn what they are really trying to do, which is simple
to
> >>> explain
> >>>> in
> >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's
> >>>>>> complicated sometimes.  It's like an impedance mismatch.  You
could
> >>> just
> >>>>>> solve the problem very easily programmed in straight Java/Scala.
> >>>>>>
> >>>>>> Anyway, it's very easy to drop down a level in the API and program
> >>>> whatever
> >>>>>> you want but users don't seem to *perceive* it that way.
> >>>>>>
> >>>>>> Just some thoughts...  Any feedback?  Have any of you had similar
> >>>>>> experiences when working with newer Flink users or as a newer
Flink
> >>> user
> >>>>>> yourself?  Can/should we do anything to make the *lower* level
API
> >>> more
> >>>>>> accessible/visible to users?
> >>>>>>
> >>>>>> -Jamie
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Jamie Grier
> >> data Artisans, Director of Applications Engineering
> >> @jamiegrier <https://twitter.com/jamiegrier>
> >> jamie@data-artisans.com
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message