Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D478182DB for ; Thu, 28 May 2015 15:15:40 +0000 (UTC) Received: (qmail 17251 invoked by uid 500); 28 May 2015 15:15:40 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 17208 invoked by uid 500); 28 May 2015 15:15:40 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 17196 invoked by uid 99); 28 May 2015 15:15:39 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 May 2015 15:15:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 72C821A39A7 for ; Thu, 28 May 2015 15:15:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.099 X-Spam-Level: X-Spam-Status: No, score=-0.099 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Z-MJ0XOQZnRS for ; Thu, 28 May 2015 15:15:26 +0000 (UTC) Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id AA27F42B0E for ; Thu, 28 May 2015 15:15:25 +0000 (UTC) Received: by wizk4 with SMTP id k4so150707899wiz.1 for ; Thu, 28 May 2015 08:14:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=IhF3jzYR+0osQG6cMk9FWGufWC1NJJra1x0FdJmrYgk=; b=oyB+kGIloILoi6q6kISNgYnXuCRNt8Y/QxyDvDqRcpJXmJVp3vJN+4HbRvsOdLEBit qhIQkiTBDCk5RJ5QNkgNur+2CzfTP6GGPHI7pJXbhwqSNsUl79X3rj2Iz/YMqze6snD2 0Tk+1frNDp17JIww40la7EeaVxQ/aNYyClvq0qVo+Be1VRkMLmrrTXRZPn6ad6xUtQtx i4F5TQLlWOZ69Ma6bmjLJ2QTawwzeiZLrlOi3VgcrNFFxX3O+S5+BTseAgVA+ysYSU8Y FqDBb5tr1V0iVkZh1t0gTcBqXhtOsDSulPaHYGQM5osC8v/47YLjirRzyOdgKwP2Z5Xu 4Q5g== MIME-Version: 1.0 X-Received: by 10.194.121.38 with SMTP id lh6mr6174588wjb.2.1432826079701; Thu, 28 May 2015 08:14:39 -0700 (PDT) Received: by 10.194.205.103 with HTTP; Thu, 28 May 2015 08:14:39 -0700 (PDT) In-Reply-To: References: <5566E78C.7040809@kth.se> Date: Thu, 28 May 2015 17:14:39 +0200 Message-ID: Subject: Re: Some feedback on the Gradient Descent Code From: Mikio Braun To: Till Rohrmann Cc: dev@flink.apache.org, Theodore Vasiloudis Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable GradientDescent is the just the (batch-)SGD optimizer right? Actually I think the parameter update should be done by a RegularizationFunction. IMHO the structure should be like this: GradientDescent - collects gradient and regularization updates from - CostFunction LinearModelCostFunction - inherits from Cost Function - parameterized by LossFunction and RegularizationFunction Does that make sense? On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann wr= ote: > Hey Mikio, > > yes you=E2=80=99re right. The SGD only needs to know the gradient of the = loss > function and some mean to update the weights in accordance with the > regularization scheme. Additionally, we also need to be able to compute t= he > loss for the convergence criterion. > > That=E2=80=99s also how it is implemented in the before-mentioned PR righ= t now. The > question which arises now is whether the LossFunction should be responsib= le > for the updating of the weight vector or not. So what we could do is the > following: Remove the regularization from the LossFunction and let the > GradientDescent solver do the parameterUpdate. This either requires > specialized implementations such as GradientDescentL1 and GradientDescent= L2 > or a configuration parameter which defines a regularization function whic= h > is then called for the parameterUpdate. > > trait RegularizationFunction{ > def parameterUpdate(weightVector: WeightVector, gradient: WeightVector, > learningRate: Double, regularizationConstant: Double): WeightVector > > def regularizationValue(weigthVector: WeightVector): Double > } > > Both ans=C3=A4tze are semantically equivalent. I have no strong preferenc= e for > either of them. What do you think is the better approach? > > > On Thu, May 28, 2015 at 4:13 PM, Mikio Braun > wrote: >> >> [Ok, so maybe this is exactly what is implemented, sorry if I'm just >> repeating you... ] >> >> So >> >> C(w, xys) =3D C regularization(w) + sum over yxs of losses >> >> Gradient is >> >> C grad reg(w) + sum grad losses(w, xy) >> >> For some regularization functions, regularization is better performed >> by some explicit operation on the weights like a subtracting some >> value if it's still to large or so (forgot how that exactly) works. >> >> The losses consists of the model (for now, linear I guess), plus the >> kind of losses (hinge, squared, and maybe logistic for LR) >> >> So from the viewpoint of SGD, we only need a way to compute the >> gradient for a set of examples, and maybe some way to update(e.g. >> shrink) weights for regularization. >> >> So SGD takes some >> >> trait CostFunction { >> def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vect= or >> >> def parameterUpdate(weightVector: Vector): Vector >> } >> >> so that SGD can run. >> >> On the other hand, we could have a >> >> class LinearModelCostFunction { >> val lossFunction: LossFunction >> val regularizationFunction: RegularizationFunction >> >> val regularizationConstant >> >> def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vecto= r >> =3D { >> regularizationConstant * >> regularizationFunciton.gradient(weightVector) + >> xys.map(xy =3D> lossFunction(predict(x, y) * loss) * X).sum() >> } >> >> def parameterUpdate =3D regularizationFunction.parameterUpdate >> } >> >> That way the optimizer and the cost function would be entirely >> uncoupled, and you could use the SGD to train on practically anything >> which has a data-dependent gradient. >> >> What do you think? >> >> -M >> >> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun >> wrote: >> > Oh wait.. continue to type. accidentally sent out the message to early= . >> > >> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun >> > wrote: >> >> Hi Till and Theodore, >> >> >> >> I think the code is cleaned up a lot now, introducing the >> >> mapWithBcVariable helped a lot. >> >> >> >> I also get that the goal was to make a cost function for learning >> >> linear model configurable well. My main concern was that the solver >> >> itself was already too specifically bound to the kind of cost functio= n >> >> we wanted to optimize. >> >> >> >> Maybe it helps to discuss this in terms of the math of the >> >> optimization methods instead of in the code. The mathematics tends to >> >> be much more concise and small compared to the completely unrolled >> >> code derived from the math. >> >> >> >> Just very briefly, and correct me if I'm wrong, this is about >> >> optimizing some cost function C(w, xys) where w the weight vector and >> >> xys is a set of labeled vectors. For the update, we only need to >> >> consider the gradient C(w, xys) >> >> >> >> Typically >> >> >> >> C(w, xys) =3D C regularization(w) + sum over yxs of losses >> >> >> >> so the gradient is >> >> C grad reg(w) + sum grad losses(w, xy) >> >> >> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann >> >> wrote: >> >>> What tweaks would that be? I mean what is required to implement >> >>> L-BFGS? >> >>> >> >>> I guess that we won=E2=80=99t get rid of the case statements because= we have >> >>> to >> >>> decide between two code paths: One with and the other without >> >>> convergence >> >>> criterion. But I think by pulling each branch in its own function, i= t >> >>> becomes clearer now. >> >>> >> >>> I agree that the update step is not really the responsibility of the >> >>> LossFunction. The only reason why the current prototype does it this >> >>> way is >> >>> that the regularization is part of the LossFunction. By moving the >> >>> regularization out of the LossFunction and make it part of the Solve= r >> >>> we can >> >>> avoid to clutter the LossFunction interface. However, then we have t= o >> >>> provide different implementations for the Solver, one for each >> >>> regularization type (L1, L2, etc.). Or we provide the regularization >> >>> as a >> >>> parameter which is responsible for the updating of the weight vector= . >> >>> >> >>> What do you prefer? >> >>> >> >>> Cheer, >> >>> Till >> >>> >> >>> >> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis >> >>> wrote: >> >>>> >> >>>> Hello Mikio and Till, >> >>>> >> >>>> I really like the idea of syntactic sugar to reduce the boilerplate >> >>>> as >> >>>> well. >> >>>> >> >>>> I'm looking at the PR now. For me the goal was to decouple as many >> >>>> things >> >>>> as possible, but in the end a lot of the >> >>>> logic ended up in SGD itself. Tthe previous design allowed us to >> >>>> implement >> >>>> LBFGS using Breeze in a clean manner, >> >>>> but I think that this new design from Till can also work for that, >> >>>> with >> >>>> some tweaks. But we have to really look closely >> >>>> and see if it makes things simpler. The SGD implementation looks >> >>>> simpler I >> >>>> think, but there is still the need for the case >> >>>> statements, and I'm not sure yet on how we can get rid of them. >> >>>> >> >>>> The one thing that I'm not sure we want is the the coupling of the >> >>>> update >> >>>> step to the loss function (and the regularization by extension). >> >>>> It's something that is also in the current implementation of the >> >>>> optimization framework, but I was hoping we can get rid of that. >> >>>> As Mikio said we want interfaces that are clean,and updating the >> >>>> weights >> >>>> should not be the "responsibility" of the loss or the regularizatio= n, >> >>>> rather it should be handled by a function that takes the required >> >>>> information to perform the update. >> >>>> >> >>>> I'm a fan of how Breeze does optimization, where they have a >> >>>> FirstOrderMinimizer interface that has takeStep and >> >>>> chooseDescentDirection >> >>>> functions, >> >>>> and the actual optimization steps are performed in an >> >>>> infiniteIterations >> >>>> function implemented in FirstOrderMinimizer that all the first orde= r >> >>>> minimization >> >>>> algorithms use, the thing that changes in the various algorithms is >> >>>> how >> >>>> the gradients are calculated, weights updated etc. >> >>>> >> >>>> I would definitely recommend taking a look at their code. >> >>>> >> >>>> >> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote: >> >>>> >> >>>> I opened a PR [1] with the above-mentioned changes. Would be great = if >> >>>> you >> >>>> could take a look and give me some feedback whether this is now >> >>>> easier to >> >>>> understand. >> >>>> >> >>>> Cheers, >> >>>> Till >> >>>> >> >>>> [1] https://github.com/apache/flink/pull/740 >> >>>> >> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann >> >>>> >> >>>> wrote: >> >>>>> >> >>>>> Hi Mikio, >> >>>>> >> >>>>> I agree that we should add some more syntactic sugar to make >> >>>>> programming >> >>>>> with Scala more succinct and more fun :-) Especially, for the use >> >>>>> case where >> >>>>> you have a simple map operation with a broadcast variable. I like >> >>>>> your >> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction >> >>>>> creation >> >>>>> into the helper function. We could even use the pimp my library >> >>>>> pattern do >> >>>>> make it =E2=80=9Cpart=E2=80=9D of the usual DataSet. >> >>>>> >> >>>>> Something like that works: >> >>>>> >> >>>>> val x: DataSet[X] =3D ... >> >>>>> val bcVar: DataSet[B] =3D ... >> >>>>> >> >>>>> x.mapWithBroadcast(bcVar){ >> >>>>> (element: X, var: B) =3D> element - var >> >>>>> } >> >>>>> >> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a >> >>>>> little >> >>>>> bit =E2=80=9Cbetriebsblind=E2=80=9D and stopped questioning ;-) >> >>>>> >> >>>>> Concerning the strong coupling of SGD with the different parts of >> >>>>> the >> >>>>> loss function, the idea was to make it easily configurable. I agre= e, >> >>>>> though, >> >>>>> that one could at least encapsulate the prediction function as par= t >> >>>>> of the >> >>>>> loss function and also add the regularization function to it. This >> >>>>> would >> >>>>> simplify the code of SGD. A possible interface for a loss function >> >>>>> could >> >>>>> look like >> >>>>> >> >>>>> trait LossFunction { >> >>>>> def loss(labeledVector: LabeledVector, weightVector: >> >>>>> WeightVector): >> >>>>> Double >> >>>>> >> >>>>> def gradient(labeledVector: LabeledVector, weightVector: >> >>>>> WeightVector): >> >>>>> WeightVector >> >>>>> >> >>>>> def updateWeightVector(oldWeightVector: WeightVector, >> >>>>> newWeightVector: >> >>>>> WeightVector, regularizationValue: Double): WeightVector >> >>>>> } >> >>>>> >> >>>>> I can try to make a prototype of this interface to see how it look= s >> >>>>> and >> >>>>> feels like. >> >>>>> >> >>>>> Cheers, >> >>>>> >> >>>>> Till >> >>>>> >> >>>>> >> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun >> >>>>> >> >>>>> wrote: >> >>>>>> >> >>>>>> I've been playing around with a few ideas, but you could have a >> >>>>>> helper >> >>>>>> RichMapFunction like >> >>>>>> >> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) =3D> (I= n) >> >>>>>> =3D> >> >>>>>> Out) >> >>>>>> extends RichMapFunction[In, Out] >> >>>>>> { >> >>>>>> var f: (In) =3D> Out =3D _ >> >>>>>> >> >>>>>> override def open(parameters: Configuration): Unit =3D { >> >>>>>> f =3D fct(getRuntimeContext.getBroadcastVariable[B](name).get= (0)) >> >>>>>> } >> >>>>>> >> >>>>>> def map(in: In): Out =3D f(in) >> >>>>>> } >> >>>>>> >> >>>>>> which takes a function which gets the broadcast value and then >> >>>>>> returns >> >>>>>> a function doing the actual map, and you would then use it like >> >>>>>> this: >> >>>>>> >> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation]( >> >>>>>> xs: DataSet[I], bc: DataSet[B]) >> >>>>>> (fct: (B) =3D> (I) =3D> O): DataSet[O] =3D { >> >>>>>> xs.map(new MapWithBroadcast[B, I, >> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy") >> >>>>>> } >> >>>>>> >> >>>>>> And then... >> >>>>>> >> >>>>>> val x =3D env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7)) >> >>>>>> val sum =3D x.reduce(_ + _) >> >>>>>> >> >>>>>> val foo =3D mapWithBroadcast(x, sum)(sum =3D> (x: Int) =3D> x - s= um) >> >>>>>> >> >>>>>> Has something like this been considered at any point? Or are you >> >>>>>> all >> >>>>>> at the point where you can type in the broadcasting stuff while y= ou >> >>>>>> sleep ;) ? >> >>>>>> >> >>>>>> -M >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun >> >>>>>> >> >>>>>> wrote: >> >>>>>> > Hi Theodore and Till, >> >>>>>> > >> >>>>>> > last Friday Till suggested I have a look at the gradient code. >> >>>>>> > Took me >> >>>>>> > a while, but I think I got the main structure. >> >>>>>> > >> >>>>>> > I think I understand why the code is written how it is, but jus= t >> >>>>>> > speaking from a purely data science perspective, I find that a >> >>>>>> > lot of >> >>>>>> > the code is superheavy on boilterplate code. The LossCalculatio= n >> >>>>>> > and >> >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lin= es >> >>>>>> > of >> >>>>>> > code just to unpacka few broadcast variable and call a single >> >>>>>> > function. I understand why we need to use broadcast variables >> >>>>>> > here, >> >>>>>> > and that this is a common idiom in Flink, but I wonder whether >> >>>>>> > one >> >>>>>> > cannot reduce the boilerplate code. >> >>>>>> > >> >>>>>> > In my experience, it is pretty important when implementing ML >> >>>>>> > algorithms that the mapping back from the code to the algorithm >> >>>>>> > is >> >>>>>> > still somewhat clear, because if you need to do modifications o= r >> >>>>>> > debugging, it's important that you still know what corresponds = to >> >>>>>> > what, and with so many lines of code just moving data around, >> >>>>>> > that's >> >>>>>> > hardly the case. >> >>>>>> > >> >>>>>> > A general design decision I would like to question is the tight >> >>>>>> > integration of the specific loss functions, regularizers, etc. >> >>>>>> > into >> >>>>>> > the GradientDescent code. Given that it is really a pretty >> >>>>>> > general >> >>>>>> > approach which is applicable to many many different models, I >> >>>>>> > think >> >>>>>> > It's better to have the stochastic gradient code being as >> >>>>>> > agnostic as >> >>>>>> > possible about exactly what it is optimizing. Instead, you woul= d >> >>>>>> > pass >> >>>>>> > it a function which contains a method to compute a function >> >>>>>> > value, and >> >>>>>> > its derivative, and possibly some action to regularize the >> >>>>>> > parameters >> >>>>>> > (if this cannot be already done with the function itself). >> >>>>>> > >> >>>>>> > Then, such a function for learning linear models can be >> >>>>>> > configured in >> >>>>>> > the same way as it is right now. I personally would opt for >> >>>>>> > optional >> >>>>>> > parameters instead of the builder-method style, but that is ok >> >>>>>> > with >> >>>>>> > me. >> >>>>>> > >> >>>>>> > This would in particular mean that I would try to eliminate the >> >>>>>> > case >> >>>>>> > statements as much as possible. >> >>>>>> > >> >>>>>> > The result would then be a bunch of classes around the gradient >> >>>>>> > descent optimizer which very clearly just encode that algorithm= , >> >>>>>> > and a >> >>>>>> > bunch of classes around cost functions for linear models which >> >>>>>> > encode >> >>>>>> > that, and both parts talk with one another only through a small= , >> >>>>>> > clean >> >>>>>> > interface. Plus, this interface could be reused for other >> >>>>>> > optimization >> >>>>>> > algorithms like BFGS, or the Trust Region Newton method we talk= ed >> >>>>>> > about last time. >> >>>>>> > >> >>>>>> > So my suggestions: >> >>>>>> > - think of some way to reduce boilerplate code for broadcast >> >>>>>> > variables. >> >>>>>> > - split gradient and the specific model for linear models to >> >>>>>> > make >> >>>>>> > the individual parts cleaner and more extensible >> >>>>>> > >> >>>>>> > Hope this helps, >> >>>>>> > >> >>>>>> > -M >> >>>>>> > >> >>>>>> > >> >>>>>> > -- >> >>>>>> > Mikio Braun - http://blog.mikiobraun.de, >> >>>>>> > http://twitter.com/mikiobraun >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Mikio Braun - http://blog.mikiobraun.de, >> >>>>>> http://twitter.com/mikiobraun >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>>>> >> >>>>> info@data-artisans.com >> >>>>> phone +493055599146 >> >>>>> mobile +4917671721261 >> >>>>> >> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>>> >> >>>> info@data-artisans.com >> >>>> phone +493055599146 >> >>>> mobile +4917671721261 >> >>>> >> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin >> >>> >> >>> info@data-artisans.com >> >>> phone +493055599146 >> >>> mobile +4917671721261 >> >>> >> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B >> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen >> >> >> >> >> >> >> >> -- >> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobrau= n >> > >> > >> > >> > -- >> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >> >> >> >> -- >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > --=20 Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun