flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Dixit <akshayd...@gmail.com>
Subject Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534
Date Wed, 25 Mar 2015 22:35:00 GMT
Hi,
The link to the draft proposal that I've prepared is
https://gist.github.com/akshaydixi/88f3fbcebab0119a6a31
It would be great if I could get some feedback on it.
Regards,
Akshay Dixit

On Wed, Mar 25, 2015 at 2:03 AM, Akshay Dixit <akshaydixi@gmail.com> wrote:

> Thanks Gyula.
>
> I agree too that simple and working implementations are preferrable over
> hacky complex solutions. I'll start sketching out an initial straighforward
> API with only basic pattern matching features
> and base it on the existing windowing API. I'll post a draft of the
> proposal,  keeping the points you've said in mind, tomorrow, so you can
> look it over to see if its all right.
> Regards,
> Akshay Dixit
>
> On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <gyfora@apache.org> wrote:
>
>> Hey Dixit,
>>
>> Sorry for the delay, I had to discuss this in more detail with some of our
>> other core developers.
>>
>> The consensus seems to be that we would like push this project in a
>> direction where the changes can be quickly included in the next releases.
>> For this it is essential that we implement features that are complete (and
>> clean) from the users perspective. This does not necessarily mean that we
>> would like to have everything at once but rather that it is preferable to
>> start with something clean and simple (for instance the naive chained
>> filter approach) and progressively build more complex logic.
>>
>> This also mean that we would like to avoid "researchy" code in the
>> codebase
>> as much as possible. Of course once we have a stable api for this
>> functionality we can work towards making the optimizations that you have
>> mentioned like operator sharing and so on.
>>
>> The ideal proposal would give a clear sketch of the pattern matching API
>> that you would like to implement, which might be some added operators at
>> first to the current API and possible a DSL later with more advanced
>> functionality (this would probably go in a separate library until it is
>> very stable).
>>
>> So please in the proposal include a preview of what the pattern matching
>> syntax would look like integrated with the current operators, how it would
>> interact with other parts of the system etc.
>>
>> These are the thing we need to figure out before we consider the
>> optimizations I think, because it usually turns out, that the API
>> semantics
>> you would like to provide can hugely affect (probably limit) the
>> possibilities that you have afterwards in terms of optimizations.
>>
>> Let me know if you have further questions regarding this :)
>>
>> Gyula
>>
>> On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <gyfora@apache.org> wrote:
>>
>> > Hey,
>> >
>> > Give me an hour or so as I am in a meeting currently, but I will get
>> back
>> > to you afterwards.
>> >
>> > Regards,
>> > Gyula
>> >
>> > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <akshaydixi@gmail.com>
>> > wrote:
>> >
>> >> Hi,
>> >> It'd really help if I got a reply soon. It'll be helpful in writing the
>> >> proposal since the deadline is on 27th. Thanks
>> >> Regards,
>> >> Akshay Dixit
>> >>
>> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <akshaydixi@gmail.com>
>> >> wrote:
>> >>
>> >> > Thanks for the explanation Marton. I've decided to try out for
>> >> FLINK-1534.
>> >> >
>> >> > After reading through the thesis[4] and a few other papers[1][2][3],
>> I
>> >> > believe I've gathered a little context to ask more questions. But I'm
>> >> still
>> >> > not sure how Flink's internals work
>> >> > so please bear with me. Although the ongoing effort to document the
>> >> > architecture and internal is really helpful for newbies like me and
>> >> would
>> >> > greatly decrease the ramping up time.
>> >> >
>> >> > Detecting a pattern of events would comprise of a pipeline that
>> accepts
>> >> > the pattern query and
>> >> > sources of DataStreams, and outputs detected matches of that pattern
>> to
>> >> a
>> >> > sink or forwards it
>> >> > along to another stream for further computation.
>> >> >
>> >> > As you said, a simple filter-join-aggregate query system could be
>> >> > developed implementing using the existing Streaming windowing API.
>> >> > But matching over complex events and decoding their pattern queries
>> >> would
>> >> > require implementing a DSL that transforms queries into an evaluation
>> >> > model. For e.g,
>> >> > in [1], the authors have implemented an NFA automaton with a shared
>> >> > versioned buffer that models the queries. In [4], the authors
>> >> > propose a new language that is much more expressive and compiles
>> into a
>> >> > topology graph for Storm.
>> >> >
>> >> > So in Flink's case, I believe the proposed DSL would generate
>> operator
>> >> > graphs for the Flink compiler to schedule Jobgraphs over
>> TaskManagers.
>> >> > If we don't depend on the Windowing API, would we need to create new
>> >> > operators such as the Projection, Conjunction and Union operators
>> >> defined
>> >> > in [4] ?
>> >> > Also I would like to hear your thoughts on how to approach scaling
>> the
>> >> > pattern matching query. Note all these techniques talk about scaling
>> a
>> >> > single query.
>> >> > I've read various ways such as
>> >> >
>> >> > 1.  Merging equivalent runs[1] -: This seems a good way to squash
>> >> multiple
>> >> > instances of pattern matching forks into a single one if they have
>> the
>> >> same
>> >> > state.
>> >> > But I'm not sure how we would implement this in Flink since this is
a
>> >> > runtime optimization.
>> >> >
>> >> > 2.  Implementing a matched version buffer[1] -: This would involve
>> >> sharing
>> >> > state of a buffer datastructure across multiple candidate match
>> >> instances
>> >> > for the pattern.
>> >> >
>> >> > 3.  Splitting complex composite patterns into simpler sub-patterns[4]
>> >> and
>> >> > executing separate queries to detect those sub-patterns. This might
>> >> > translate into different
>> >> > tasks and duplicating the source datastreams to all the new generated
>> >> > tasks.
>> >> >
>> >> > Also since I don't know how the Flink compiler behaves, would some
of
>> >> the
>> >> > optimizations involve making changes to it too?
>> >> >
>> >> > Regards,
>> >> > Akshay Dixit
>> >> >
>> >> > [1] : Efficient Pattern Matching over Event Streams
>> >> > <
>> http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf
>> >> >
>> >> > [2] : On Supporting Kleene Closure over Event Streams
>> >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf>
>> >> > [3] : Processing Flows of Information: From Data Stream to Complex
>> Event
>> >> > Processing
>> >> > <
>> >>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf
>> >> >
>> >> > [4] : Distributing Complex Event Detection
>> >> > <
>> >>
>> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf>
>> >> >
>> >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi <
>> >> balassi.marton@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Dear Akshay,
>> >> >>
>> >> >> Thanks again for your interest and for the recent contribution
to
>> >> >> streaming.
>> >> >>
>> >> >> Both of the projects mentioned wold be largely appreciated by the
>> >> >> community, and you can also propose other project suggestions here
>> for
>> >> >> discussion.
>> >> >>
>> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting
>> point
>> >> >> and
>> >> >> indeed the basic solution can be implemented with filtering and
>> >> >> windowing/mapping with some state storing whether the cause of
an
>> event
>> >> >> has
>> >> >> been already seen. Solely relying on the now existing windowing
API
>> >> this
>> >> >> however might cause performance issues if the events also have
an
>> >> >> expiration timeout - some optimization there would be included.
The
>> >> >> further
>> >> >> challenge is to try to further exploit the parallel job execution
of
>> >> Flink
>> >> >> to possibly scale a pattern matching query.
>> >> >>
>> >> >> Best,
>> >> >>
>> >> >> Marton
>> >> >>
>> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <akshaydixi@gmail.com
>> >
>> >> >> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore,
India.
>> I'm
>> >> >> > currently interested in distributed systems and stream processing
>> >> and am
>> >> >> > looking to delve deeper into the subject, and hope to get
some
>> >> insight
>> >> >> by
>> >> >> > contributing to Apache Flink. I've gathered some idea of the
>> >> >> > flink-streaming codebase by recently working on a PR for
>> >> FLINK-1450[2].
>> >> >> >
>> >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects
>> that I
>> >> >> would
>> >> >> > love to work on over the summer. I was wondering which amongst
>> these
>> >> >> would
>> >> >> > be more appreciated by the community, so I can start working
>> towards
>> >> a
>> >> >> > proposal for either one.
>> >> >> >
>> >> >> > Regarding FLINK-1534, I was wondering why would simply merging
and
>> >> >> > filtering the existing streams for events we want to detect
not
>> work?
>> >> >> Also
>> >> >> > on going through the document mentioned by @mbalassi in the
JIRA
>> >> >> > comment[5], the authors specify some Runtime Event Detection
>> >> concepts in
>> >> >> > Section 5.2. I'm assuming the project entails on building
a
>> similar
>> >> >> analogy
>> >> >> > using Flink and the deliverables would include working pattern
>> >> matching
>> >> >> > operators over Flink DataStreams as described in the report.
If
>> so,
>> >> then
>> >> >> > shouldn't it be trivial to implement the described the Binary
>> >> operator
>> >> >> > using a WindowedStream and a Filter?
>> >> >> > I hope my questions don't seem misplaced here and I would
>> appreciate
>> >> >> links
>> >> >> > to literature where I can learn more on the topic.
>> >> >> >
>> >> >> > Regards,
>> >> >> > Akshay Dixit
>> >> >> >
>> >> >> > [1] : http://akshaydixi.me
>> >> >> > [2] : https://github.com/apache/flink/pull/481
>> >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617
>> >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534
>> >> >> > [5] :
>> >> >> >
>> >>
>> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message