Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CD2D0184BC for ; Sun, 14 Feb 2016 11:53:33 +0000 (UTC) Received: (qmail 59011 invoked by uid 500); 14 Feb 2016 11:53:33 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 58931 invoked by uid 500); 14 Feb 2016 11:53:33 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 58919 invoked by uid 99); 14 Feb 2016 11:53:33 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Feb 2016 11:53:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1872A180485 for ; Sun, 14 Feb 2016 11:53:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.298 X-Spam-Level: * X-Spam-Status: No, score=1.298 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=sics-se.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kOL5xEhRmvwH for ; Sun, 14 Feb 2016 11:53:26 +0000 (UTC) Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id EB05E25996 for ; Sun, 14 Feb 2016 11:53:25 +0000 (UTC) Received: by mail-ig0-f176.google.com with SMTP id xg9so35991451igb.1 for ; Sun, 14 Feb 2016 03:53:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sics-se.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=D41IcTAoWI252lEyGkMEQFepdz1CqHkkaaa9hXQz3u0=; b=asvNJIyjcdexCGpAhPUsZjs8R34l6LfviMwn05wqVB91FHuJy7bEpr77sD9oH83ToP ujgIajI+Tb+FYvMXlIhK6AXKgvil2M9bpS5yAuxQwf0+gJB22yEH1oFWzVCCKaKOdM8Y qbfkjzjUtKQIVyUk/OC+WZ5Hn4XNfw+8QvEOKmHEvKsCN9ruOskGiFhNoLAcualEy95Y 4cWnJw14nBXG+74DDmtNWePY9OnI/M9e9C6lltjfCpv/+3ftVCnJlCMrleQ1eWHCxygU Dr5TKHZVIcjaAHD0FuAKnQef99viMvSYkI+knG0l5KuQaw0hWsE05VL6Sq7wn4SLHROy iKgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=D41IcTAoWI252lEyGkMEQFepdz1CqHkkaaa9hXQz3u0=; b=exxJ0BlwOty2ZgkfxJzusew7hJmLl1caEdKMdRH8kOULOMaZ0+CJ79sqk925jjeQrB dUn/88Yr2+CIa28HXanr9bu7IDE/las/+gEbmyPmCbjoCstQKNz3neFKMJH3ohvph5fC a2d+yIgj1BD6SIA7czjAgrt4SS/LuPgSt2eNR2pSF0sBe0i2NrkoK1vjlfG6KdSMkMC9 BuWdkJtPv6Niu81YePEdangKudnmlH9CH3fzgO7GZhacZmwpRLtX2LlfWOGCIRjYr868 W84+elLRepxp0eknp7TwwH9xGMrYaUZDaS4V3QSjS2GhDfSTJM1HjQohYtuGEUfUedaC 5l/A== X-Gm-Message-State: AG10YOQGBnDEGXPJ9/FN2uLl9vuFRAMZ6r4bk2woY8DA28S+V75FE3qXBHxdQYN3L13VXJwITTqqBapYH/y8+/ku MIME-Version: 1.0 X-Received: by 10.50.143.102 with SMTP id sd6mr3300838igb.3.1455450804770; Sun, 14 Feb 2016 03:53:24 -0800 (PST) Received: by 10.107.178.139 with HTTP; Sun, 14 Feb 2016 03:53:24 -0800 (PST) In-Reply-To: References: Date: Sun, 14 Feb 2016 12:53:24 +0100 Message-ID: Subject: Re: Opening a discussion on FlinkML From: Martin Neumann To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a1135e45634fdf4052bb98c9b --001a1135e45634fdf4052bb98c9b Content-Type: text/plain; charset=UTF-8 I think the focus of this discussion should be how we proceed not what to do. The what comes from the committers anyway. There are several people who like to commit, including people from the Streamline project. Having pull requests that are older than 6 Month is not good for any project. The main question is how can we develop the library further with high standards but without creating a bottleneck that holds things back to much. In my opinion it would be best if we find enough resources to keep things inside Flink. However if we have to depend on people who are already stretched for time, splitting it out might be the better option. (path 1 from Theos original mail) cheers Martin On Fri, Feb 12, 2016 at 3:54 PM, Suneel Marthi wrote: > On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti < > simone.robutti@radicalbit.io> wrote: > > > @Suneel > > > > 1) Totally agree, as I wrote before. > > > > 2)I agree that support for PMML is premature but we shouldn't > underestimate > > the variety and complexity of the uses of ML models in the industry. The > > adoption of Flink, hopefully, will grow and reach less innovative > realities > > where Random Forests and SVMs are still the main algorithms in use. In > > these same realities there are legacies that justify the use of PMML to > > port models. Still, FlinkML is still in an early stage so as you said, it > > doesn't make sense to spend time right now on such a feature. > > > > +1, as I mentioned earlier the PMML spec only supports classification and > clustering (I last checked this in Aug 2015, pretty sure it would not have > changed since then); hence 'Yes' it has some limited uses; 'No' - its too > premature to even talk about it given the present state of FlinkML. > > > > > 3)This would be really interesting. How do you imagine that the > integration > > with a distributed processing engine would work? > > > > I am not sure yet, we r still exploring this on Mahout project to add to > Mahout-Samsara - most of the statistics and probabilistic modeling would > then be supported by Figaro (Bayesian, MCMC etc) and hence can be external > to FlinkML. > > Figaro is Scala based. See https://github.com/p2t2/figaro > > I believe there are few other similar DSLs out there, need to dig up my old > emails. > > (Not sure if its ASLv2 License, need verification here) > > > > > > 5) Agree on this one too. To my knowledge it would be the best option > > together with SAMOA (for the streaming part). > > > > There's already Flink - Samoa integration in place IIRC. > > > > > > 2016-02-12 15:25 GMT+01:00 Suneel Marthi : > > > > > My 2 cents as someone who's done ML over the years - having worked on > > Oryx > > > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due > to > > > strict workplace enforcement") and understands well their limitations. > > > > > > 1. FlinkML in its present form seems like "do it like how Spark did > it". > > > > > > 2. The recent discussion about PMML support in Flink to my mind is a > > clear > > > example of putting the cart before the horse. Why are we even talking > > PMML > > > when there ain't much ML algos in FlinkML? > > > > > > For a real good implementation of PMML and how its being used (with > > jPMML), > > > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx > 2.0 > > > predates Spark and is a clean example of separating PMML from the > > > underlying framework (Spark or Flink). > > > > > > We have had PMML discussions on the Mahout project in the past, but the > > > idea never gained any traction in large part due to PMML spec > limitations > > > (mostly for clustering and classification algorithms) and the lack of > > > adoption within the community. > > > > > > See the discussion here and specifically Ted Dunning's comment on PMML > - > > > > > > > > > http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E > > > > > > Most of the ML in practice (deployed in production) today are > > Recommenders > > > and Deep Learning - both of which are not supported by the PMML spec. > > > > > > 3. Leveraging a probabilistic programming language like Figaro might > be a > > > good way to go (just my thought) - that way most of the ML groundwork > > would > > > be external to Flink. > > > > > > 4. Within the Mahout community, we had been talking (and are working) > on > > > redoing the Samsara Distributed linear algebra framework to support > Flink > > > (in large part we realized that Flink is a better platform than the > more > > > popular one out there that Slim wouldn't wanna talk about :) ). > > > > > > We should be having a release out in the next few weeks (depending on > > > committers' availability). It would be great if FlinkML had something > > like > > > it. > > > > > > There was a good audience to Sebastian's talk on this subject at #FF15 > in > > > October. > > > > > > 5. Its a good idea to add Flink support to H2O as Slim had suggested > > > elsewhere in this thread. > > > > > > > > > Thoughts? > > > > > > > > > > > > On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti < > > > simone.robutti@radicalbit.io> wrote: > > > > > > > I will say my opinion as a person that have worked with SparkML and > > will > > > be > > > > involved soon in the development of ML solutions on Flink. > > > > > > > > In these days I tried to track the evolution and development of > FlinkML > > > and > > > > I see a big critical point: FlinkML looks a lot like a placeholder > for > > > > commercial purposes but there's not enough investment and commitment > to > > > > achieve an usable product. I did a few things with FlinkML coming > from > > > > SparkML and I can say that it's unsuitable for most of the common use > > > cases > > > > covered by SparkML (that is not a good ML library at all in terms of > > > > usability). > > > > > > > > So my question is: do we really need FlinkML? The roadmap looks a lot > > > like > > > > "Spark has SparkML so we MUST have a ML library too". This could be > > > > reasonable if you aim at a fine-tuned library tailored on the > specifics > > > of > > > > Flink that are different from Spark. This could be even better if you > > > > developed an implementation of SGD that exploit the computational > model > > > of > > > > Flink that, I think, could achieve a lot more compared to the actual > > > > implementation. This is a subject that I want to study better before > > > saying > > > > more but I'm looking at better parallelization strategies for data > and > > > > models. > > > > > > > > Going back to FlinkML, do we really need to reimplement the same > > > workhorse > > > > algorithms already implemented in SparkML, H2O, Mahout, SystemML, > Weka, > > > > Oryx and other distributed learning libraries? Is it really useful at > > > this > > > > stage? Given the current resources of the project, wouldn't it be > more > > > > reasonable to invest time and energy in integrating more mature > > libraries > > > > (and eventually rich tooling that would give a big advantage over the > > > other > > > > libraries)? > > > > > > > > I would like to comment on your proposals but my experience in > > > > collaborative open source development is way too limited to form an > > > > interesting opinion. Also I had no historical visibility on the > > > motivations > > > > and discussions behind the development of FlinkML and I would like > > > pointers > > > > to read something on what is the shared vision on this part of the > > > project > > > > so that I could join the discussion from now on. > > > > > > > > Thanks, > > > > > > > > Simone > > > > > > > > > > > > > > > > 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis < > > > > theodoros.vasiloudis@gmail.com>: > > > > > > > > > Hello all, > > > > > > > > > > I would like to get a conversation started on how we plan to move > > > forward > > > > > with FlinkML. > > > > > > > > > > Development on the library currently has been mostly dormant for > the > > > > past 6 > > > > > months, > > > > > > > > > > mainly I believe because of the lack of available committers to > > review > > > > PRs. > > > > > > > > > > Last month we got together with Till and Marton and talked about > how > > we > > > > > could try to > > > > > > > > > > solve this and ensure continued development of the library. > > > > > > > > > > We see 3 possible paths we could take: > > > > > > > > > > 1. > > > > > > > > > > Externalize the library, creating a new repository under the > > Apache > > > > > Flink project. This decouples the development of FlinkML from > the > > > > Flink > > > > > release cycle, allowing us to move faster and incorporate new > > > features > > > > > as > > > > > they become available. As FlinkML is a library under development > > > tying > > > > > it > > > > > to specific versions does not make much sense anyway. The > library > > > > would > > > > > depend on the latest snapshot version of Flink. It would then be > > > > > possible > > > > > for the Flink distribution to cherry-pick parts of the library > to > > be > > > > > included with the core distribution. > > > > > 2. > > > > > > > > > > Keep the development under the main Flink project but bring in > new > > > > > committers. This would mean that the development remains as is > and > > > is > > > > > tied > > > > > to core Flink releases, but new worked should get merged at much > > > more > > > > > regular intervals through the help of committers other than > Till. > > > > Marton > > > > > Balassi has volunteered for that role and I hope that more might > > > take > > > > up > > > > > that role. > > > > > 3. A third option is to fork FlinkML on a repository on which we > > are > > > > > able to commit freely (again through PRs and reviews of course) > > and > > > > > merge > > > > > good parts back into the main repo once in a while. This allows > > for > > > > > faster > > > > > progress and more experimental work but obviously creates > > > > fragmentation. > > > > > > > > > > > > > > > I would like to hear your thoughts on these three options, as well > as > > > > > discuss other > > > > > > > > > > alternatives that could help move FlinkML forward. > > > > > > > > > > Cheers, > > > > > Theodore > > > > > > > > > > > > > > > --001a1135e45634fdf4052bb98c9b--