Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C840219A8E for ; Tue, 29 Mar 2016 09:12:08 +0000 (UTC) Received: (qmail 91575 invoked by uid 500); 29 Mar 2016 09:12:03 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 91501 invoked by uid 500); 29 Mar 2016 09:12:03 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 91489 invoked by uid 99); 29 Mar 2016 09:12:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2016 09:12:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DC81D1A094E for ; Tue, 29 Mar 2016 09:12:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id QmrfL3hBk8H0 for ; Tue, 29 Mar 2016 09:12:00 +0000 (UTC) Received: from mail-lb0-f179.google.com (mail-lb0-f179.google.com [209.85.217.179]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 6CDF45FAEE for ; Tue, 29 Mar 2016 09:12:00 +0000 (UTC) Received: by mail-lb0-f179.google.com with SMTP id vo2so6199339lbb.1 for ; Tue, 29 Mar 2016 02:12:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=zex7Xl+tBUoJd4dx2xjK0N/KNHCNheL6WhkvqMG595E=; b=ZzwnomBUa1cyyEU8d0Yz58bSJSie+Sf07tiqhhitw9BpfmGKT9B3MmAqMws/9Bm7Uy OO4ONiJblaI2ox0ffturRidn5F5Hy8HPmtJm9KV6XuV4j3sJN6/IZA9M9yL5WXzR+6Pf qNClvYRE4Mokvw0On9lZK6bbVXBfh6u6UsImwLYtj1tC1uMmfCgASd/LNKg5qrBlcoiR CTs5msuJVeGmd2Alk6zwJ/iMUsKZqET5nKTPUjYZHD1fFDzTzimiyhm7O2IXOeOPRvcI GutjpQjoFBwEu7FDp81jPn3K8KKFlEEUZwRFeLYezy8RdWzjhuiBK6evkDhp2tRCfm7F ZvZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=zex7Xl+tBUoJd4dx2xjK0N/KNHCNheL6WhkvqMG595E=; b=g0Bl8q36id+ILstsWP5CqEhZUau1l9G2y5Fu7fT1hCSE0Hl/HEDCFe3PWIcw+Mh3x1 ljIeNOh5EwuVs0DTIs8rtBR+dHGXtELGlJfSMosKskoxdG50INbdDHBSQQRSc+E9uU49 oKTQtDy0NdHRC7L/pFDohMQDCGaRkLi3HEU5D9YaqZLoaaHWvwf48Aj9AkUbDAitU7n8 cS85BDuY+eVLlWBnQva1AfW8v5YaZrB2gacgPmQQ/JVCdJd/Oe5EdrHNRKwbzVdCGtZf Cq+ukg471IETOfH5l6fjJ1fA3M+apdORMKDRO5SJVg55R0LJ1LQZoUWn1Ycphj2v1liY jlMA== X-Gm-Message-State: AD7BkJLJ34D0bgje+CJ7ttCfK5k1cYwuTepngf/BJ2fxEdUk8g4MX0bJw6Kg25Dkgj1BAXOAAfMtgG6t1YCRRw== X-Received: by 10.112.128.225 with SMTP id nr1mr574922lbb.101.1459242719692; Tue, 29 Mar 2016 02:11:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.136.11 with HTTP; Tue, 29 Mar 2016 02:11:40 -0700 (PDT) In-Reply-To: References: From: Theodore Vasiloudis Date: Tue, 29 Mar 2016 11:11:40 +0200 Message-ID: Subject: Re: A whole bag of ML issues To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=047d7b3a865cf2cefc052f2c6b28 --047d7b3a865cf2cefc052f2c6b28 Content-Type: text/plain; charset=UTF-8 Hello Trevor, These are indeed a lot of issues, let's see if we can fit the discussion for all of them in one thread. I'll add some comments inline. - Expand SGD to allow for predicting vectors instead of just Doubles. We have discussed this in the past and at that point decided that it didn't make sense to change the base SGD implementation to accommodate vectors. The alternatives that were presented at the time were to abstract away the type of the input/output in the Optimizer (allowing for both Vectors and Doubles), or to create specialized classes for each case. That also gives us greater flexibility in terms of optimizing performance. In terms of the ANN, I think you can hide away the Vectors in the implementation of the ANN model, and use the Optimizer interface as is, like A. Ulanov did with the Spark ANN implementation . - Allow for 'warm starts' I like the idea of having a partiFit-like function, could you present a couple of use cases where we might use it? I'm wondering if savepoints already cover this functionality. - A library of model grading metrics. > We have a (perpetually) open PR for an evaluation framework. Could you expand on "Having 'calculate RSquare' as a built in method for every regressor doesn't seem like an efficient way to do this long term." -BLAS for matrix ops (this was talked about earlier) This will be a good addition. If they are specific to the ANN implementation however I would hide them away from the rest of the code (and include in that PR only) until another usecase comes up. - A neural net has Arrays of matrices of weights (instead of just a vector). > Yes this is probably not the most efficient way to do this, but it's the "least API breaking" I'm afraid. - The linear regression implementation currently presumes it will be using > SGD but I think that should be 'settable' as a parameter > The original Optimizer was written the way you described, but we changed it later IIRC to make it more accessible (e.g. for users that don't know that you can't match L1 regularization with L-BFGS). Maybe Till can say more about the other reasons this was changed. On Mon, Mar 28, 2016 at 8:01 PM, Trevor Grant wrote: > Hey, > > I have a working prototype of an multi layer perceptron implementation > working in Flink. > > I made every possible effort to utilize existing code when possible. > > In the process of doing this there were some hacks I want/need, and think > this should be broken up into multiple PRs and possible abstract out the > whole thing because the MLP implementation I came up with is itself > designed to be extendable to Long Short Term Memory Networks. > > Top level here are some of the sub PRs > > - Expand SGD to allow for predicting vectors instead of just Doubles. This > allows the same NN code (and other algos) to be used for classification, > transformations, and regressions. > > - Allow for 'warm starts' -> this requires adding a parameter to > IterativeSolver that basically starts on iteration N. This is somewhat > akin to the idea of partial fits in sklearn OR making the iterative solver > have some sort of internal counter and then when you call 'fit' it just > runs another N iterations (which is set by SetIterations) instead of > assuming it is back to zero. This might seem trivial but has significant > impact on step size calculations. > > - A library of model grading metrics. Having 'calculate RSquare' as a built > in method for every regressor doesn't seem like an efficient way to do this > long term. > > -BLAS for matrix ops (this was talked about earlier) > > - A neural net has Arrays of matrices of weights (instead of just a > vector). Currently I flatten the array of matrices out into a weight > vector and reassemble it into an array of matrices, though this is probably > not super effecient. > > - The linear regression implementation currently presumes it will be using > SGD but I think that should be 'settable' as a parameter, because if not- > why do we have all of those other nice SGD methods just hanging out? > Similarly the loss function / partial loss is hard coded. I reccomend > making the current setup the 'defaults' of a 'setOptimizer' method. I.e. > if you want to just run a MLR you can do it based on the examples, but if > you want to use a fancy optimizer you can create it from existing methods, > or make your own, then call something like `mlr.setOptimizer( myOptimizer > )` > > - and more > > At any rate- if some people could weigh in / direct me how to proceed that > would be swell. > > Thanks! > tg > > > > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > --047d7b3a865cf2cefc052f2c6b28--