flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Setrakyan <dsetrak...@apache.org>
Subject Re: [DISCUSS] Flink and Ignite integration
Date Fri, 01 May 2015 16:42:58 GMT
Hi Stephan,

Your suggestions are very interesting. I think we should pick a couple of
paths we can tackle with minimal effort and start there. I am happy to help
in getting this effort started (maybe we should have a Skype discussion?)

My comments are below...

D.

On Wed, Apr 29, 2015 at 3:35 AM, Stephan Ewen <sewen@apache.org> wrote:

> Hi everyone!
>
> First of all, hello to the Ignite community and happy to hear that you are
> interested in collaborating!
>
> Building on what Fabian wrote, here is a list of efforts that we ourselves
> have started, or that would be useful.
>
> Let us know what you think!
>
> Stephan
>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a FileSystem
>
> -------------------------------------------------------------------------------------------------------
>
> That should be the simplest addition. Flink integrates the FileSystem
> classes from Hadoop. If there is an Ignite version of that FileSystem
> class, you should
> be able to register it in a Hadoop config, point to the config in the Flink
> config and it should work out of the box.
>
> If Ignite does not yet have that FileSystem, it is easy to implement a
> Flink Filesystem.
>


Ignite implements Hadoop File System API. More info here:
http://apacheignite.readme.io/v1.0/docs/file-system



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as a parameter server
>
> -------------------------------------------------------------------------------------------------------
>
> This is one approach that a contributor has started with. The goal is to
> store a large set of model parameters in a distributed fashion,
> such that all Flink TaskManagers can access them and update them
> (asynchronously).
>
> The core requirements here are:
>  - Fast put performance. Often, no consistency is needed, put operations
> may simply overwrite each other, some of them can even be tolerated to get
> lost
>  - Fast get performance, heavy local caching.
>


Sounds like a very straight forward integration with IgniteCache:
http://apacheignite.readme.io/v1.0/docs/jcache



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> Flink periodically checkpoints the state of streaming operators. We are
> looking to have different backends to
> store the state to, and Ignite could be one of them.
>
> This would write periodically (say every 5 seconds) a chunk of binary data
> (possible 100s of MB on every node) into Ignite.
>


Again, I think you can utilize either partitioned or replicated caches from
Ignite here.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for Streaming Operator State
>
> -------------------------------------------------------------------------------------------------------
>
> If we want to directly store the state of streaming computation in Ignite
> (rather than storing it in Flink and backing it
> up to Ignite), we have the following requirements:
>
>   - Massive put and get performance, up to millions per second per machine.
>   - No synchronous replication needed, replication can happen
> asynchronously in the background
>   - At certain points, Flink will request to get a signal once everything
> is properly replicated
>


This sounds like a good use case for Ignite Streaming, which basically
loads large continuous amounts of streamed data into Ignite caches. Ignite
has abstraction called "IgniteDataStreamer" would satisfy your
requirements. It does everything asynchronously and can provide
notifications if needed. More info here:
http://apacheignite.readme.io/v1.0/docs/data-streamers



>
>
> -------------------------------------------------------------------------------------------------------
> Ignite as distributed backup for intermediate results
>
> -------------------------------------------------------------------------------------------------------
>
> Flink may cache intermediate results for recovering or resuming computation
> at a certain point in the program. This would be similar to backing up
> streaming state. One in a while, a giant put
> operation with GBs of binary data.
>
>

I think Ignite File System (IGFS) would be a perfect candidate for it. If
this approach does not work, then you can think about using Ignite caches
directly, but it may get a bit tricky if you plan to store objects with 1GB
of size each.



>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Batch Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> I think this would be interesting, and we can make this such that programs
> are binary compatible.
> Flink currently has multiple execution backends already: Flink local, Flink
> distributed, Tez, Java Collections.
> It is designed layerd and pluggable
>
> You as a programmer define the desired execution backend by chosing the
> corresponding ExecutionEnvironment,
> such as "ExecutionEnvironemtn.createLocalEnvironement()", or
> "ExecutionEnvironemtn.createCollectionsEnvironement()"
> If you look at the "execute()" methods, they take the Flink program and
> prepares it for execution in the corresponding backend.
>


Hm... this sounds *very* interesting. If I understand correctly, you are
suggesting that Ignite becomes one of the Flink backends, right? Is there a
basic example online or in the product for it, so I can gage what it would
take?



>
>
>
>
> -------------------------------------------------------------------------------------------------------
> Run Flink Streaming Programs on Ignite's compute fabric.
>
> -------------------------------------------------------------------------------------------------------
>
> The execution mechanism for streaming programs is changing fast right now.
> I would postpone this for a few
> weeks until we have converged there.
>
>
Sounds good.


>
>
>
>
>
>
> On Wed, Apr 29, 2015 at 1:28 AM, Dmitriy Setrakyan <dsetrakyan@apache.org>
> wrote:
>
> > On Tue, Apr 28, 2015 at 5:55 PM, Fabian Hueske <fhueske@gmail.com>
> wrote:
> >
> > > Thanks Cos for starting this discussion, hi to the Ignite community!
> > >
> > > The probably easiest and most straightforward integration of Flink and
> > > Ignite would be to go through Ignite's IGFS. Flink can be easily
> extended
> > > to support additional filesystems.
> > >
> > > However, the Flink community is currently also looking for a solution
> to
> > > checkpoint operator state of running stream processing programs. Flink
> > > processes data streams in real time similar to Storm, i.e., it
> schedules
> > > all operators of a streaming program and data is continuously flowing
> > from
> > > operator to operator. Instead of acknowledging each individual record,
> > > Flink injects stream offset markers into the stream in regular
> intervals.
> > > Whenever, an operator receives such a marker it checkpoints its current
> > > state (currently to the master with some limitations). In case of a
> > > failure, the stream is replayed (using a replayable source such as
> Kafka)
> > > from the last checkpoint that was not received by all sink operators
> and
> > > all operator states are reset to that checkpoint.
> > > We had already looked at Ignite and were wondering whether Ignite could
> > be
> > > used to reliably persist the state of streaming operator.
> > >
> >
> > Fabian, do you need these checkpoints stored in memory (with optional
> > redundant copies, or course) or on disk? I think in-memory makes a lot
> more
> > sense from performance standpoint, and can easily be done in Ignite.
> >
> >
> > >
> > > The other points I mentioned on Twitter are just rough ideas at the
> > moment.
> > >
> > > Cheers, Fabian
> > >
> > > 2015-04-29 0:23 GMT+02:00 Dmitriy Setrakyan <dsetrakyan@apache.org>:
> > >
> > > > Thanks Cos.
> > > >
> > > > Hello Flink Community.
> > > >
> > > > From Ignite standpoint we definitely would be interested in providing
> > > Flink
> > > > processing API on top of Ignite Data Grid or IGFS. It would be
> > > interesting
> > > > to hear what steps would be required for such integration or if there
> > are
> > > > other integration points.
> > > >
> > > > D.
> > > >
> > > > On Tue, Apr 28, 2015 at 2:57 PM, Konstantin Boudnik <cos@apache.org>
> > > > wrote:
> > > >
> > > > > Following the lively exchange in Twitter (sic!) I would like to
> bring
> > > > > together
> > > > > Ignite and Flink communities to discuss the benefits of the
> > integration
> > > > and
> > > > > see where we can start it.
> > > > >
> > > > > We have this recently opened ticket
> > > > >   https://issues.apache.org/jira/browse/IGNITE-813
> > > > >
> > > > > and Fabian has listed the following points:
> > > > >
> > > > >  1) data store
> > > > >  2) parameter server for ML models
> > > > >  3) Checkpointing streaming op state
> > > > >  4) continuously updating views from streams
> > > > >
> > > > > I'd add
> > > > >  5) using Ignite IGFS to speed up Flink's access to HDFS data.
> > > > >
> > > > > I see a lot of interesting correlations between two projects and
> > wonder
> > > > if
> > > > > Flink guys can step up with a few thoughts on where Flink can
> benefit
> > > the
> > > > > most
> > > > > from Ignite's in-memory fabric architecture? Perhaps, it can be
> used
> > as
> > > > > in-memory storage where the other components of the stack can
> quickly
> > > > > access
> > > > > and work w/ the data w/o a need to dump it back to slow storage?
> > > > >
> > > > > Thoughts?
> > > > >   Cos
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message