Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAC27z=Mz3g=6ubGiaZttxNqAUw_92-b8HSEGSpQ86++bg-aNZA@mail.gmail.com>
References: 
 <CADCURACuCdsTMS5Ecqt-y9pGsdX1ErSELnCLyTp_bg-UziEhNA@mail.gmail.com>
	<CADCURAByDiys8RMZkOFdiy1xxPSsXP5idzjVU4sRoL+VULt+wQ@mail.gmail.com>
	<CAL9i4+fo+int=YyOwL5d+2T5s352TY3AyAKfte9vS4o_9cLo=g@mail.gmail.com>
	<CAL9i4+cMk=x-RrEyogONv0_CyujNFvL4U9yqO9tPqR4Vfaqs6w@mail.gmail.com>
	<5566E78C.7040809@kth.se>
	<CAL9i4+dJ2S=o5X=MNj1n_d7gcJCaVnRhSWYQPqBGNjUHSjkWEw@mail.gmail.com>
	<CADCURADp8P8OgUD4n7kNs6YmayB3=L5wEBH9m38Cwx5udLv5zg@mail.gmail.com>
	<CADCURAAyUnQPSLCnygEBVU56no1sah3C-wc12MYt5L=YCUo78g@mail.gmail.com>
	<CADCURADg=dREu3hRviV6nhnaXMrBOc4xV7sqibscep5S14f1tQ@mail.gmail.com>
	<CAC27z=Mz3g=6ubGiaZttxNqAUw_92-b8HSEGSpQ86++bg-aNZA@mail.gmail.com>
Date: Thu, 28 May 2015 17:14:39 +0200
Message-ID: 
 <CADCURACu_fYx53tYgBQ1rWiCK-r-EU9jEkavH7u8gLPgLBGQ8A@mail.gmail.com>
Subject: Re: Some feedback on the Gradient Descent Code
From: Mikio Braun <mikiobraun@googlemail.com>
To: Till Rohrmann <till.rohrmann@gmail.com>
Cc: dev@flink.apache.org, Theodore Vasiloudis <tvas@kth.se>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

GradientDescent is the just the (batch-)SGD optimizer right? Actually
I think the parameter update should be done by a
RegularizationFunction.

IMHO the structure should be like this:

GradientDescent
  - collects gradient and regularization updates from - CostFunction
LinearModelCostFunction
  - inherits from Cost Function
  - parameterized by LossFunction and RegularizationFunction

Does that make sense?


On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann <till.rohrmann@gmail.com> wr=
ote:
> Hey Mikio,
>
> yes you=E2=80=99re right. The SGD only needs to know the gradient of the =
loss
> function and some mean to update the weights in accordance with the
> regularization scheme. Additionally, we also need to be able to compute t=
he
> loss for the convergence criterion.
>
> That=E2=80=99s also how it is implemented in the before-mentioned PR righ=
t now. The
> question which arises now is whether the LossFunction should be responsib=
le
> for the updating of the weight vector or not. So what we could do is the
> following: Remove the regularization from the LossFunction and let the
> GradientDescent solver do the parameterUpdate. This either requires
> specialized implementations such as GradientDescentL1 and GradientDescent=
L2
> or a configuration parameter which defines a regularization function whic=
h
> is then called for the parameterUpdate.
>
> trait RegularizationFunction{
>   def parameterUpdate(weightVector: WeightVector, gradient: WeightVector,
> learningRate: Double, regularizationConstant: Double): WeightVector
>
>   def regularizationValue(weigthVector: WeightVector): Double
> }
>
> Both ans=C3=A4tze are semantically equivalent. I have no strong preferenc=
e for
> either of them. What do you think is the better approach?
>
>
> On Thu, May 28, 2015 at 4:13 PM, Mikio Braun <mikiobraun@googlemail.com>
> wrote:
>>
>> [Ok, so maybe this is exactly what is implemented, sorry if I'm just
>> repeating you... ]
>>
>> So
>>
>>    C(w, xys) =3D C regularization(w) + sum over yxs of losses
>>
>> Gradient is
>>
>>    C grad reg(w) + sum grad losses(w, xy)
>>
>> For some regularization functions, regularization is better performed
>> by some explicit operation on the weights like a subtracting some
>> value if it's still to large or so (forgot how that exactly) works.
>>
>> The losses consists of the model (for now, linear I guess), plus the
>> kind of losses (hinge, squared, and maybe logistic for LR)
>>
>> So from the viewpoint of SGD, we only need a way to compute the
>> gradient for a set of examples, and maybe some way to update(e.g.
>> shrink) weights for regularization.
>>
>> So SGD takes some
>>
>> trait CostFunction {
>>    def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vect=
or
>>
>>    def parameterUpdate(weightVector: Vector): Vector
>> }
>>
>> so that SGD can run.
>>
>> On the other hand, we could have a
>>
>> class LinearModelCostFunction {
>>    val lossFunction: LossFunction
>>    val regularizationFunction: RegularizationFunction
>>
>>   val regularizationConstant
>>
>>   def gradient(weightVector: Vector, xys: DataSet[LabeledVector]): Vecto=
r
>> =3D {
>>      regularizationConstant *
>> regularizationFunciton.gradient(weightVector) +
>>      xys.map(xy =3D> lossFunction(predict(x, y) * loss) * X).sum()
>>   }
>>
>>   def parameterUpdate =3D regularizationFunction.parameterUpdate
>> }
>>
>> That way the optimizer and the cost function would be entirely
>> uncoupled, and you could use the SGD to train on practically anything
>> which has a data-dependent gradient.
>>
>> What do you think?
>>
>> -M
>>
>> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobraun@googlemail.com>
>> wrote:
>> > Oh wait.. continue to type. accidentally sent out the message to early=
.
>> >
>> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobraun@googlemail.co=
m>
>> > wrote:
>> >> Hi Till and Theodore,
>> >>
>> >> I think the code is cleaned up a lot now, introducing the
>> >> mapWithBcVariable helped a lot.
>> >>
>> >> I also get that the goal was to make a cost function for learning
>> >> linear model configurable well. My main concern was that the solver
>> >> itself was already too specifically bound to the kind of cost functio=
n
>> >> we wanted to optimize.
>> >>
>> >> Maybe it helps to discuss this in terms of the math of the
>> >> optimization methods instead of in the code. The mathematics tends to
>> >> be much more concise and small compared to the completely unrolled
>> >> code derived from the math.
>> >>
>> >> Just very briefly, and correct me if I'm wrong, this is about
>> >> optimizing some cost function C(w, xys) where w the weight vector and
>> >> xys is a set of labeled vectors. For the update, we only need to
>> >> consider the gradient C(w, xys)
>> >>
>> >> Typically
>> >>
>> >> C(w, xys) =3D C regularization(w) + sum over yxs of losses
>> >>
>> >> so the gradient is
>> >> C grad reg(w) + sum grad losses(w, xy)
>> >>
>> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <till@data-artisans.co=
m>
>> >> wrote:
>> >>> What tweaks would that be? I mean what is required to implement
>> >>> L-BFGS?
>> >>>
>> >>> I guess that we won=E2=80=99t get rid of the case statements because=
 we have
>> >>> to
>> >>> decide between two code paths: One with and the other without
>> >>> convergence
>> >>> criterion. But I think by pulling each branch in its own function, i=
t
>> >>> becomes clearer now.
>> >>>
>> >>> I agree that the update step is not really the responsibility of the
>> >>> LossFunction. The only reason why the current prototype does it this
>> >>> way is
>> >>> that the regularization is part of the LossFunction. By moving the
>> >>> regularization out of the LossFunction and make it part of the Solve=
r
>> >>> we can
>> >>> avoid to clutter the LossFunction interface. However, then we have t=
o
>> >>> provide different implementations for the Solver, one for each
>> >>> regularization type (L1, L2, etc.). Or we provide the regularization
>> >>> as a
>> >>> parameter which is responsible for the updating of the weight vector=
.
>> >>>
>> >>> What do you prefer?
>> >>>
>> >>> Cheer,
>> >>> Till
>> >>>
>> >>>
>> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <tvas@kth.se>
>> >>> wrote:
>> >>>>
>> >>>> Hello Mikio and Till,
>> >>>>
>> >>>> I really like the idea of syntactic sugar to reduce the boilerplate
>> >>>> as
>> >>>> well.
>> >>>>
>> >>>> I'm looking at the PR now. For me the goal was to decouple as many
>> >>>> things
>> >>>> as possible, but in the end a lot of the
>> >>>> logic ended up in SGD itself. Tthe previous design allowed us to
>> >>>> implement
>> >>>> LBFGS using Breeze in a clean manner,
>> >>>> but I think that this new design from Till can also work for that,
>> >>>> with
>> >>>> some tweaks. But we have to really look closely
>> >>>> and see if it makes things simpler. The SGD implementation looks
>> >>>> simpler I
>> >>>> think, but there is still the need for the case
>> >>>> statements, and I'm not sure yet on how we can get rid of them.
>> >>>>
>> >>>> The one thing that I'm not sure we want is the the coupling of the
>> >>>> update
>> >>>> step to the loss function (and the regularization by extension).
>> >>>> It's something that is also in the current implementation of the
>> >>>> optimization framework, but I was hoping we can get rid of that.
>> >>>> As Mikio said we want interfaces that are clean,and updating the
>> >>>> weights
>> >>>> should not be the "responsibility" of the loss or the regularizatio=
n,
>> >>>> rather it should be handled by a function that takes the required
>> >>>> information to perform the update.
>> >>>>
>> >>>> I'm a fan of how Breeze does optimization, where they have a
>> >>>> FirstOrderMinimizer interface that has takeStep and
>> >>>> chooseDescentDirection
>> >>>> functions,
>> >>>> and the actual optimization steps are performed in an
>> >>>> infiniteIterations
>> >>>> function implemented in FirstOrderMinimizer that all the first orde=
r
>> >>>> minimization
>> >>>> algorithms use, the thing that changes in the various algorithms is
>> >>>> how
>> >>>> the gradients are calculated, weights updated etc.
>> >>>>
>> >>>> I would definitely recommend taking a look at their code.
>> >>>>
>> >>>>
>> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
>> >>>>
>> >>>> I opened a PR [1] with the above-mentioned changes. Would be great =
if
>> >>>> you
>> >>>> could take a look and give me some feedback whether this is now
>> >>>> easier to
>> >>>> understand.
>> >>>>
>> >>>> Cheers,
>> >>>> Till
>> >>>>
>> >>>> [1] https://github.com/apache/flink/pull/740
>> >>>>
>> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann
>> >>>> <till@data-artisans.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi Mikio,
>> >>>>>
>> >>>>> I agree that we should add some more syntactic sugar to make
>> >>>>> programming
>> >>>>> with Scala more succinct and more fun :-) Especially, for the use
>> >>>>> case where
>> >>>>> you have a simple map operation with a broadcast variable. I like
>> >>>>> your
>> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction
>> >>>>> creation
>> >>>>> into the helper function. We could even use the pimp my library
>> >>>>> pattern do
>> >>>>> make it =E2=80=9Cpart=E2=80=9D of the usual DataSet.
>> >>>>>
>> >>>>> Something like that works:
>> >>>>>
>> >>>>> val x: DataSet[X] =3D ...
>> >>>>> val bcVar: DataSet[B] =3D ...
>> >>>>>
>> >>>>> x.mapWithBroadcast(bcVar){
>> >>>>>     (element: X, var: B) =3D> element - var
>> >>>>> }
>> >>>>>
>> >>>>> Great idea Mikio. I really like that :-) I guess that we all got a
>> >>>>> little
>> >>>>> bit =E2=80=9Cbetriebsblind=E2=80=9D and stopped questioning ;-)
>> >>>>>
>> >>>>> Concerning the strong coupling of SGD with the different parts of
>> >>>>> the
>> >>>>> loss function, the idea was to make it easily configurable. I agre=
e,
>> >>>>> though,
>> >>>>> that one could at least encapsulate the prediction function as par=
t
>> >>>>> of the
>> >>>>> loss function and also add the regularization function to it. This
>> >>>>> would
>> >>>>> simplify the code of SGD. A possible interface for a loss function
>> >>>>> could
>> >>>>> look like
>> >>>>>
>> >>>>> trait LossFunction {
>> >>>>>   def loss(labeledVector: LabeledVector, weightVector:
>> >>>>> WeightVector):
>> >>>>> Double
>> >>>>>
>> >>>>>   def gradient(labeledVector: LabeledVector, weightVector:
>> >>>>> WeightVector):
>> >>>>> WeightVector
>> >>>>>
>> >>>>>   def updateWeightVector(oldWeightVector: WeightVector,
>> >>>>> newWeightVector:
>> >>>>> WeightVector, regularizationValue: Double): WeightVector
>> >>>>> }
>> >>>>>
>> >>>>> I can try to make a prototype of this interface to see how it look=
s
>> >>>>> and
>> >>>>> feels like.
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> Till
>> >>>>>
>> >>>>>
>> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun
>> >>>>> <mikiobraun@googlemail.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> I've been playing around with a few ideas, but you could have a
>> >>>>>> helper
>> >>>>>> RichMapFunction like
>> >>>>>>
>> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct: (B) =3D> (I=
n)
>> >>>>>> =3D>
>> >>>>>> Out)
>> >>>>>>     extends RichMapFunction[In, Out]
>> >>>>>> {
>> >>>>>>   var f: (In) =3D> Out =3D _
>> >>>>>>
>> >>>>>>   override def open(parameters: Configuration): Unit =3D {
>> >>>>>>     f =3D fct(getRuntimeContext.getBroadcastVariable[B](name).get=
(0))
>> >>>>>>   }
>> >>>>>>
>> >>>>>>   def map(in: In): Out =3D f(in)
>> >>>>>> }
>> >>>>>>
>> >>>>>> which takes a function which gets the broadcast value and then
>> >>>>>> returns
>> >>>>>> a function doing the actual map, and you would then use it like
>> >>>>>> this:
>> >>>>>>
>> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
>> >>>>>>     xs: DataSet[I], bc: DataSet[B])
>> >>>>>>   (fct: (B) =3D> (I) =3D> O): DataSet[O] =3D {
>> >>>>>>   xs.map(new MapWithBroadcast[B, I,
>> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
>> >>>>>> }
>> >>>>>>
>> >>>>>> And then...
>> >>>>>>
>> >>>>>> val x =3D env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
>> >>>>>> val sum =3D x.reduce(_ + _)
>> >>>>>>
>> >>>>>> val foo =3D mapWithBroadcast(x, sum)(sum =3D> (x: Int) =3D> x - s=
um)
>> >>>>>>
>> >>>>>> Has something like this been considered at any point? Or are you
>> >>>>>> all
>> >>>>>> at the point where you can type in the broadcasting stuff while y=
ou
>> >>>>>> sleep ;) ?
>> >>>>>>
>> >>>>>> -M
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun
>> >>>>>> <mikiobraun@googlemail.com>
>> >>>>>> wrote:
>> >>>>>> > Hi Theodore and Till,
>> >>>>>> >
>> >>>>>> > last Friday Till suggested I have a look at the gradient code.
>> >>>>>> > Took me
>> >>>>>> > a while, but I think I got the main structure.
>> >>>>>> >
>> >>>>>> > I think I understand why the code is written how it is, but jus=
t
>> >>>>>> > speaking from a purely data science perspective, I find that a
>> >>>>>> > lot of
>> >>>>>> > the code is superheavy on boilterplate code. The LossCalculatio=
n
>> >>>>>> > and
>> >>>>>> > so on RichMapFunctions in GradientDescent easily take up 20 lin=
es
>> >>>>>> > of
>> >>>>>> > code just to unpacka few broadcast variable and call a single
>> >>>>>> > function. I understand why we need to use broadcast variables
>> >>>>>> > here,
>> >>>>>> > and that this is a common idiom in Flink, but I wonder whether
>> >>>>>> > one
>> >>>>>> > cannot reduce the boilerplate code.
>> >>>>>> >
>> >>>>>> > In my experience, it is pretty important when implementing ML
>> >>>>>> > algorithms that the mapping back from the code to the algorithm
>> >>>>>> > is
>> >>>>>> > still somewhat clear, because if you need to do modifications o=
r
>> >>>>>> > debugging, it's important that you still know what corresponds =
to
>> >>>>>> > what, and with so many lines of code just moving data around,
>> >>>>>> > that's
>> >>>>>> > hardly the case.
>> >>>>>> >
>> >>>>>> > A general design decision I would like to question is the tight
>> >>>>>> > integration of the specific loss functions, regularizers, etc.
>> >>>>>> > into
>> >>>>>> > the GradientDescent code. Given that it is really a pretty
>> >>>>>> > general
>> >>>>>> > approach which is applicable to many many different models, I
>> >>>>>> > think
>> >>>>>> > It's better to have the stochastic gradient code being as
>> >>>>>> > agnostic as
>> >>>>>> > possible about exactly what it is optimizing. Instead, you woul=
d
>> >>>>>> > pass
>> >>>>>> > it a function which contains a method to compute a function
>> >>>>>> > value, and
>> >>>>>> > its derivative, and possibly some action to regularize the
>> >>>>>> > parameters
>> >>>>>> > (if this cannot be already done with the function itself).
>> >>>>>> >
>> >>>>>> > Then, such a function for learning linear models can be
>> >>>>>> > configured in
>> >>>>>> > the same way as it is right now. I personally would opt for
>> >>>>>> > optional
>> >>>>>> > parameters instead of the builder-method style, but that is ok
>> >>>>>> > with
>> >>>>>> > me.
>> >>>>>> >
>> >>>>>> > This would in particular mean that I would try to eliminate the
>> >>>>>> > case
>> >>>>>> > statements as much as possible.
>> >>>>>> >
>> >>>>>> > The result would then be a bunch of classes around the gradient
>> >>>>>> > descent optimizer which very clearly just encode that algorithm=
,
>> >>>>>> > and a
>> >>>>>> > bunch of classes around cost functions for linear models which
>> >>>>>> > encode
>> >>>>>> > that, and both parts talk with one another only through a small=
,
>> >>>>>> > clean
>> >>>>>> > interface. Plus, this interface could be reused for other
>> >>>>>> > optimization
>> >>>>>> > algorithms like BFGS, or the Trust Region Newton method we talk=
ed
>> >>>>>> > about last time.
>> >>>>>> >
>> >>>>>> > So my suggestions:
>> >>>>>> >   - think of some way to reduce boilerplate code for broadcast
>> >>>>>> > variables.
>> >>>>>> >   - split gradient and the specific model for linear models to
>> >>>>>> > make
>> >>>>>> > the individual parts cleaner and more extensible
>> >>>>>> >
>> >>>>>> > Hope this helps,
>> >>>>>> >
>> >>>>>> > -M
>> >>>>>> >
>> >>>>>> >
>> >>>>>> > --
>> >>>>>> > Mikio Braun - http://blog.mikiobraun.de,
>> >>>>>> > http://twitter.com/mikiobraun
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Mikio Braun - http://blog.mikiobraun.de,
>> >>>>>> http://twitter.com/mikiobraun
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>>>
>> >>>>> info@data-artisans.com
>> >>>>> phone +493055599146
>> >>>>> mobile +4917671721261
>> >>>>>
>> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>>
>> >>>> info@data-artisans.com
>> >>>> phone +493055599146
>> >>>> mobile +4917671721261
>> >>>>
>> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
>> >>>
>> >>> info@data-artisans.com
>> >>> phone +493055599146
>> >>> mobile +4917671721261
>> >>>
>> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
>> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen
>> >>
>> >>
>> >>
>> >> --
>> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobrau=
n
>> >
>> >
>> >
>> > --
>> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>>
>>
>>
>> --
>> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>
>


--=20
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun