Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 21B3A200C22 for ; Tue, 21 Feb 2017 16:00:29 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2037A160B68; Tue, 21 Feb 2017 15:00:29 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 434D9160B3E for ; Tue, 21 Feb 2017 16:00:28 +0100 (CET) Received: (qmail 24635 invoked by uid 500); 21 Feb 2017 15:00:27 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 24624 invoked by uid 99); 21 Feb 2017 15:00:27 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2017 15:00:27 +0000 Received: from mail-yw0-f180.google.com (mail-yw0-f180.google.com [209.85.161.180]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 0252E1A00A2 for ; Tue, 21 Feb 2017 15:00:26 +0000 (UTC) Received: by mail-yw0-f180.google.com with SMTP id w75so65163633ywg.1 for ; Tue, 21 Feb 2017 07:00:26 -0800 (PST) X-Gm-Message-State: AMke39lIbbAVYThNsNCymac3R6p01CKC9oUdyG4ORHDSByXO15AZump7GZyQhfbl7gxkgOD9VCE215/pv9Dk+Q== X-Received: by 10.129.156.11 with SMTP id t11mr21079635ywg.325.1487689225861; Tue, 21 Feb 2017 07:00:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.129.91.133 with HTTP; Tue, 21 Feb 2017 06:59:45 -0800 (PST) In-Reply-To: References: <1e0827f6-16d3-76a4-2174-38dde29022ef@gaborhermann.com> <1487632239812-16064.post@n3.nabble.com> From: Till Rohrmann Date: Tue, 21 Feb 2017 15:59:45 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Flink ML roadmap To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=94eb2c0b90c6d80b4805490ba342 archived-at: Tue, 21 Feb 2017 15:00:29 -0000 --94eb2c0b90c6d80b4805490ba342 Content-Type: text/plain; charset=UTF-8 Thanks a lot for all your valuable input. It's great to see all your interest in Flink and its ML library :-) 1) Direction of FlinkML In order to reboot the FlinkML library we should indeed first decide on its direction and come up with a roadmap to get the community behind. Since we only have limited resources the question for me is first of all whether we continue developing a batch ML library or whether we concentrate on streaming machine learning. The core idea of FlinkML was to provide the user with an easy toolbox to create machine learning pipelines. These pipelines are per se not batch or streaming specific but so far all our implementations are based on Flink's batch API. While implementing the ML algorithms we realized that Flink's engine has still some deficiencies on the batch side. Theo already mentioned the iteration problem with static inputs [1] and the problem of caching intermediate results [2]. But there are also other problems such as dynamic memory management [3] and a leg wise scheduling [4] for complex topologies. Without these features, I don't see that Flink will be able to efficiently execute batch ML jobs. Unfortunately, all of these problems are far from trivial to solve and will require quite some changes to Flink's runtime. Given Flink's current focus on stream processing, I don't see enough community capacities left to implement these features soon. Furthermore, if we decide to continue pursuing the batch direction, then we'll be in direct competition with more established frameworks such as SparkML, Weka, TensorFlow and scikit-learn, for example. I guess that alone the work to catch up with these libraries in terms of algorithm support will be quite challenging. Therefore, I think it would be more promising to concentrate on streaming ML and try to establish Flink's brand there. Streaming ML has not been as thoroughly explored as the batch counterpart and there are not too many players on the field. Furthermore, it would be well aligned with the direction of the rest of the project. 1.1) Possible features I agree with Theo that model serving/low latency prediction would be a really good/almost natural use case for Flink. For that we would need to be able to import trained models and do predictions with them. Maybe Clipper is a good solution for that or maybe PMML or another model format. That is something we would have to research. Next, in order to support continuous model updates (maybe from a periodically triggered batch job) we would need side input support. With these two features we could probably already realize some really cool use cases. 2) Growing Flink's ML community One of the problems with FlinkML, as you've mentioned it, was the lack of active committer support after the initial development. As Gabor pointed out if there is no committer around then there is only little chance to become one if nothing gets merged, even though we're in heavy need for them. Since I'm the culprit in this case, I can tell you that it would be tremendously helpful if the community (including in our case mostly contributors) continues reviewing actively each others PRs. If a PR is in good shape than it's much easier (less work) for to merge it. I think this could be an immediate action point. Next, I started a discussion thread [5] about restructuring Flink in order to decrease test and build times but also to allow adding new committers more easily for modules where we have a high need. Maybe this can help to solve the committer problem. 3) Showcasing capabilities I agree with Timur's observation that we have far too little material out there which showcases what's actually possible to do with Flink wrt ML. That is something which we can start right away to change. One good possibility is always to write a blog post about an interesting use case you've implemented. Thus, I like very much Katherin's idea. And indeed when I implemented the ALS matrix factorization with Flink, we came across a lot of problems with Flink. The other good option which was mentioned is the creation of a kind of ML cookbook. The cookbook could contain advanced recipes how to solve certain problems with FlinkML. The Flink community always wanted to create such a cookbook for Flink in general. Maybe we could lay the first foundation for it. [1] https://issues.apache.org/jira/browse/FLINK-2396 [2] https://issues.apache.org/jira/browse/FLINK-1404 [3] https://issues.apache.org/jira/browse/FLINK-1101 [4] https://issues.apache.org/jira/browse/FLINK-2119 [5] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Project-build-time-and-possible-restructuring-tt16088.html Cheers, Till On Tue, Feb 21, 2017 at 12:04 PM, Theodore Vasiloudis < theodoros.vasiloudis@gmail.com> wrote: > Thank you all for your thoughts on the matter. > > Andrea brought up some further engine considerations that we need to > address in order to have a competitive ML engine on Flink. > > I'm happy to see many people willing to contribute to the development of ML > on Flink. The way I see it, there needs to be buy-in from the rest of the > community for such changes to go through. > > If then you are interested in helping out, tackling one of the issues > mentioned in my previous email or the ones mentioned by Andrea are the most > critical ones, as they require making changes to the core. > > If you want to take up one of those issues the best way is to start a > conversation on the list, and gauge the opinion of the community. > > Finally, as Stavros mentioned, we need to come up with an updated roadmap > for FlinkML that includes these issues. > > @Andrea, the idea of an online learning library for Flink has been broached > before, and this semester I have one Master student working on exactly > that. From my conversations with people in the industry however, almost > nobody uses online learning in production, at best models are updated every > 5 minutes. So the impact would probably not be very large. > > I would like to bring up again the topic of model serving that I think fits > the Flink use-case much better. Developing a system like Clipper [1] on top > of Flink could be one of the best ways to use Flink for ML. > > Regards, > Theodore > > [1] Clipper: A Low-Latency Online Prediction Serving System - > https://arxiv.org/abs/1612.03079 > > On Tue, Feb 21, 2017 at 12:10 AM, Andrea Spina > > wrote: > > > Hi all, > > > > Thanks Stavros for pushing forward the discussion which I feel really > > relevant. > > > > Since I'm approaching actively the community just right now and I haven't > > enough experience and such visibility around the Flink community, I'd > limit > > myself to share an opinion as a Flink user. > > > > I'm using Flink since almost a year along two different experiences, but > > I've bumped into the question "how to handle ML workloads and keep Flink > as > > the main engine?" in both cases. Then the first point raises in my mind: > > why > > do I need to adopt an extra system for purely ML purposes: how amazing > > could > > be to benefit the Flink engine as ML features provider and to avoid > paying > > the effort to maintain an additional engine? This thought links also > @Timur > > opinion: I believe that users would prefer way more a unified > architecture > > in this case. Even if a user want to use an external tool/library - > perhaps > > providing additional language support (e.g. R) - so that user should be > > capable to run it on top of Flink. > > > > Along my work with Flink I needed to implement some ML algorithms on both > > Flink and Spark and I often struggled with Flink performances: namely, I > > think (in the name of the bigger picture) we should first focus the > effort > > on solving some well-known Flink limitations as @theodore pinpointed. I'd > > like to highlight [1] and [2] which I find relevant. Since the community > > would decide to go ahead with FlinkML I believe fixing the above > described > > issues may be a good starting point. That would also definitely push > > forward > > some important integrations as Apache SystemML. > > > > Given all these points, I'm increasingly convinced that Online Machine > > Learning would be the real final objective and the more suitable goal > since > > we're talking about a real-time streaming engine and - from a real high > > point of view - I believe Flink would fit this topic in a more genuine > way > > than the batch case. We've a connector for Apache SAMOA, but it seems in > an > > early stage of development IMHO and not really active. If we want to make > > something within Flink instead, we need to speed up the design of some > > features (e.g. side inputs [3]). > > > > I really hope we can define a new roadmap by which we can finally push > > forward the topic. I will put my best to help in this way. > > > > Sincerely, > > Andrea > > > > [1] Add a FlinkTools.persist style method to the Data Set > > https://issues.apache.org/jira/browse/FLINK-1730 > > [2] Only send data to each taskmanager once for broadcasts > > https://cwiki.apache.org/confluence/display/FLINK/FLIP- > > 5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts > > [3] Side inputs - Evolving or static Filter/Enriching > > https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv- > > MKQYN3m4/edit# > > http://apache-flink-mailing-list-archive.1008284.n3. > > nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For- > > Streaming-API-td11529.html > > > > > > > > -- > > View this message in context: http://apache-flink-mailing- > > list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML- > > roadmap-tp16040p16064.html > > Sent from the Apache Flink Mailing List archive. mailing list archive at > > Nabble.com. > > > --94eb2c0b90c6d80b4805490ba342--