flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <till.rohrm...@gmail.com>
Subject Re: Some feedback on the Gradient Descent Code
Date Thu, 28 May 2015 15:23:57 GMT
Yes GradientDescent == (batch-)SGD.

That was also my first idea of how to implement it. However, what happens
if the regularization is specific to the actually used algorithm. For
example, for L-BFGS with L1 regularization you have a different
`parameterUpdate` step (Orthant-wise Limited Memory QuasiNewton method)
than for SGD with L1 (proximal operator). Thus, we would have to make sure
that the right `RegularizationFunction` is used with the `CostFunction`
depending on the optimization algorithm you choose. Maybe there are also
some optimization algorithms which cannot work with certain regularization
types.

Do you have some more insights into this?

On Thu, May 28, 2015 at 5:14 PM, Mikio Braun <mikiobraun@googlemail.com>
wrote:

> GradientDescent is the just the (batch-)SGD optimizer right? Actually
> I think the parameter update should be done by a
> RegularizationFunction.
>
> IMHO the structure should be like this:
>
> GradientDescent
>   - collects gradient and regularization updates from - CostFunction
> LinearModelCostFunction
>   - inherits from Cost Function
>   - parameterized by LossFunction and RegularizationFunction
>
> Does that make sense?
>
>
>
> On Thu, May 28, 2015 at 5:00 PM, Till Rohrmann <till.rohrmann@gmail.com>
> wrote:
> > Hey Mikio,
> >
> > yes you’re right. The SGD only needs to know the gradient of the loss
> > function and some mean to update the weights in accordance with the
> > regularization scheme. Additionally, we also need to be able to compute
> the
> > loss for the convergence criterion.
> >
> > That’s also how it is implemented in the before-mentioned PR right now.
> The
> > question which arises now is whether the LossFunction should be
> responsible
> > for the updating of the weight vector or not. So what we could do is the
> > following: Remove the regularization from the LossFunction and let the
> > GradientDescent solver do the parameterUpdate. This either requires
> > specialized implementations such as GradientDescentL1 and
> GradientDescentL2
> > or a configuration parameter which defines a regularization function
> which
> > is then called for the parameterUpdate.
> >
> > trait RegularizationFunction{
> >   def parameterUpdate(weightVector: WeightVector, gradient: WeightVector,
> > learningRate: Double, regularizationConstant: Double): WeightVector
> >
> >   def regularizationValue(weigthVector: WeightVector): Double
> > }
> >
> > Both ansätze are semantically equivalent. I have no strong preference for
> > either of them. What do you think is the better approach?
> >
> >
> > On Thu, May 28, 2015 at 4:13 PM, Mikio Braun <mikiobraun@googlemail.com>
> > wrote:
> >>
> >> [Ok, so maybe this is exactly what is implemented, sorry if I'm just
> >> repeating you... ]
> >>
> >> So
> >>
> >>    C(w, xys) = C regularization(w) + sum over yxs of losses
> >>
> >> Gradient is
> >>
> >>    C grad reg(w) + sum grad losses(w, xy)
> >>
> >> For some regularization functions, regularization is better performed
> >> by some explicit operation on the weights like a subtracting some
> >> value if it's still to large or so (forgot how that exactly) works.
> >>
> >> The losses consists of the model (for now, linear I guess), plus the
> >> kind of losses (hinge, squared, and maybe logistic for LR)
> >>
> >> So from the viewpoint of SGD, we only need a way to compute the
> >> gradient for a set of examples, and maybe some way to update(e.g.
> >> shrink) weights for regularization.
> >>
> >> So SGD takes some
> >>
> >> trait CostFunction {
> >>    def gradient(weightVector: Vector, xys: DataSet[LabeledVector]):
> Vector
> >>
> >>    def parameterUpdate(weightVector: Vector): Vector
> >> }
> >>
> >> so that SGD can run.
> >>
> >> On the other hand, we could have a
> >>
> >> class LinearModelCostFunction {
> >>    val lossFunction: LossFunction
> >>    val regularizationFunction: RegularizationFunction
> >>
> >>   val regularizationConstant
> >>
> >>   def gradient(weightVector: Vector, xys: DataSet[LabeledVector]):
> Vector
> >> = {
> >>      regularizationConstant *
> >> regularizationFunciton.gradient(weightVector) +
> >>      xys.map(xy => lossFunction(predict(x, y) * loss) * X).sum()
> >>   }
> >>
> >>   def parameterUpdate = regularizationFunction.parameterUpdate
> >> }
> >>
> >> That way the optimizer and the cost function would be entirely
> >> uncoupled, and you could use the SGD to train on practically anything
> >> which has a data-dependent gradient.
> >>
> >> What do you think?
> >>
> >> -M
> >>
> >> On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <mikiobraun@googlemail.com
> >
> >> wrote:
> >> > Oh wait.. continue to type. accidentally sent out the message to
> early.
> >> >
> >> > On Thu, May 28, 2015 at 4:03 PM, Mikio Braun <
> mikiobraun@googlemail.com>
> >> > wrote:
> >> >> Hi Till and Theodore,
> >> >>
> >> >> I think the code is cleaned up a lot now, introducing the
> >> >> mapWithBcVariable helped a lot.
> >> >>
> >> >> I also get that the goal was to make a cost function for learning
> >> >> linear model configurable well. My main concern was that the solver
> >> >> itself was already too specifically bound to the kind of cost
> function
> >> >> we wanted to optimize.
> >> >>
> >> >> Maybe it helps to discuss this in terms of the math of the
> >> >> optimization methods instead of in the code. The mathematics tends
to
> >> >> be much more concise and small compared to the completely unrolled
> >> >> code derived from the math.
> >> >>
> >> >> Just very briefly, and correct me if I'm wrong, this is about
> >> >> optimizing some cost function C(w, xys) where w the weight vector and
> >> >> xys is a set of labeled vectors. For the update, we only need to
> >> >> consider the gradient C(w, xys)
> >> >>
> >> >> Typically
> >> >>
> >> >> C(w, xys) = C regularization(w) + sum over yxs of losses
> >> >>
> >> >> so the gradient is
> >> >> C grad reg(w) + sum grad losses(w, xy)
> >> >>
> >> >> On Thu, May 28, 2015 at 3:04 PM, Till Rohrmann <
> till@data-artisans.com>
> >> >> wrote:
> >> >>> What tweaks would that be? I mean what is required to implement
> >> >>> L-BFGS?
> >> >>>
> >> >>> I guess that we won’t get rid of the case statements because
we have
> >> >>> to
> >> >>> decide between two code paths: One with and the other without
> >> >>> convergence
> >> >>> criterion. But I think by pulling each branch in its own function,
> it
> >> >>> becomes clearer now.
> >> >>>
> >> >>> I agree that the update step is not really the responsibility of
the
> >> >>> LossFunction. The only reason why the current prototype does it
this
> >> >>> way is
> >> >>> that the regularization is part of the LossFunction. By moving
the
> >> >>> regularization out of the LossFunction and make it part of the
> Solver
> >> >>> we can
> >> >>> avoid to clutter the LossFunction interface. However, then we have
> to
> >> >>> provide different implementations for the Solver, one for each
> >> >>> regularization type (L1, L2, etc.). Or we provide the regularization
> >> >>> as a
> >> >>> parameter which is responsible for the updating of the weight
> vector.
> >> >>>
> >> >>> What do you prefer?
> >> >>>
> >> >>> Cheer,
> >> >>> Till
> >> >>>
> >> >>>
> >> >>> On Thu, May 28, 2015 at 12:01 PM, Theodore Vasiloudis <tvas@kth.se>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello Mikio and Till,
> >> >>>>
> >> >>>> I really like the idea of syntactic sugar to reduce the boilerplate
> >> >>>> as
> >> >>>> well.
> >> >>>>
> >> >>>> I'm looking at the PR now. For me the goal was to decouple
as many
> >> >>>> things
> >> >>>> as possible, but in the end a lot of the
> >> >>>> logic ended up in SGD itself. Tthe previous design allowed
us to
> >> >>>> implement
> >> >>>> LBFGS using Breeze in a clean manner,
> >> >>>> but I think that this new design from Till can also work for
that,
> >> >>>> with
> >> >>>> some tweaks. But we have to really look closely
> >> >>>> and see if it makes things simpler. The SGD implementation
looks
> >> >>>> simpler I
> >> >>>> think, but there is still the need for the case
> >> >>>> statements, and I'm not sure yet on how we can get rid of them.
> >> >>>>
> >> >>>> The one thing that I'm not sure we want is the the coupling
of the
> >> >>>> update
> >> >>>> step to the loss function (and the regularization by extension).
> >> >>>> It's something that is also in the current implementation of
the
> >> >>>> optimization framework, but I was hoping we can get rid of
that.
> >> >>>> As Mikio said we want interfaces that are clean,and updating
the
> >> >>>> weights
> >> >>>> should not be the "responsibility" of the loss or the
> regularization,
> >> >>>> rather it should be handled by a function that takes the required
> >> >>>> information to perform the update.
> >> >>>>
> >> >>>> I'm a fan of how Breeze does optimization, where they have
a
> >> >>>> FirstOrderMinimizer interface that has takeStep and
> >> >>>> chooseDescentDirection
> >> >>>> functions,
> >> >>>> and the actual optimization steps are performed in an
> >> >>>> infiniteIterations
> >> >>>> function implemented in FirstOrderMinimizer that all the first
> order
> >> >>>> minimization
> >> >>>> algorithms use, the thing that changes in the various algorithms
is
> >> >>>> how
> >> >>>> the gradients are calculated, weights updated etc.
> >> >>>>
> >> >>>> I would definitely recommend taking a look at their code.
> >> >>>>
> >> >>>>
> >> >>>> On 05/28/2015 10:22 AM, Till Rohrmann wrote:
> >> >>>>
> >> >>>> I opened a PR [1] with the above-mentioned changes. Would be
great
> if
> >> >>>> you
> >> >>>> could take a look and give me some feedback whether this is
now
> >> >>>> easier to
> >> >>>> understand.
> >> >>>>
> >> >>>> Cheers,
> >> >>>> Till
> >> >>>>
> >> >>>> [1] https://github.com/apache/flink/pull/740
> >> >>>>
> >> >>>> On Thu, May 28, 2015 at 1:16 AM, Till Rohrmann
> >> >>>> <till@data-artisans.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi Mikio,
> >> >>>>>
> >> >>>>> I agree that we should add some more syntactic sugar to
make
> >> >>>>> programming
> >> >>>>> with Scala more succinct and more fun :-) Especially, for
the use
> >> >>>>> case where
> >> >>>>> you have a simple map operation with a broadcast variable.
I like
> >> >>>>> your
> >> >>>>> approach of wrapping the, admittedly, cumbersome RichMapFunction
> >> >>>>> creation
> >> >>>>> into the helper function. We could even use the pimp my
library
> >> >>>>> pattern do
> >> >>>>> make it “part” of the usual DataSet.
> >> >>>>>
> >> >>>>> Something like that works:
> >> >>>>>
> >> >>>>> val x: DataSet[X] = ...
> >> >>>>> val bcVar: DataSet[B] = ...
> >> >>>>>
> >> >>>>> x.mapWithBroadcast(bcVar){
> >> >>>>>     (element: X, var: B) => element - var
> >> >>>>> }
> >> >>>>>
> >> >>>>> Great idea Mikio. I really like that :-) I guess that we
all got a
> >> >>>>> little
> >> >>>>> bit “betriebsblind” and stopped questioning ;-)
> >> >>>>>
> >> >>>>> Concerning the strong coupling of SGD with the different
parts of
> >> >>>>> the
> >> >>>>> loss function, the idea was to make it easily configurable.
I
> agree,
> >> >>>>> though,
> >> >>>>> that one could at least encapsulate the prediction function
as
> part
> >> >>>>> of the
> >> >>>>> loss function and also add the regularization function
to it. This
> >> >>>>> would
> >> >>>>> simplify the code of SGD. A possible interface for a loss
function
> >> >>>>> could
> >> >>>>> look like
> >> >>>>>
> >> >>>>> trait LossFunction {
> >> >>>>>   def loss(labeledVector: LabeledVector, weightVector:
> >> >>>>> WeightVector):
> >> >>>>> Double
> >> >>>>>
> >> >>>>>   def gradient(labeledVector: LabeledVector, weightVector:
> >> >>>>> WeightVector):
> >> >>>>> WeightVector
> >> >>>>>
> >> >>>>>   def updateWeightVector(oldWeightVector: WeightVector,
> >> >>>>> newWeightVector:
> >> >>>>> WeightVector, regularizationValue: Double): WeightVector
> >> >>>>> }
> >> >>>>>
> >> >>>>> I can try to make a prototype of this interface to see
how it
> looks
> >> >>>>> and
> >> >>>>> feels like.
> >> >>>>>
> >> >>>>> Cheers,
> >> >>>>>
> >> >>>>> Till
> >> >>>>>
> >> >>>>>
> >> >>>>> On Tue, May 26, 2015 at 3:03 PM, Mikio Braun
> >> >>>>> <mikiobraun@googlemail.com>
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>> I've been playing around with a few ideas, but you
could have a
> >> >>>>>> helper
> >> >>>>>> RichMapFunction like
> >> >>>>>>
> >> >>>>>> class MapWithBroadcast[B, In, Out](name: String)(fct:
(B) => (In)
> >> >>>>>> =>
> >> >>>>>> Out)
> >> >>>>>>     extends RichMapFunction[In, Out]
> >> >>>>>> {
> >> >>>>>>   var f: (In) => Out = _
> >> >>>>>>
> >> >>>>>>   override def open(parameters: Configuration): Unit
= {
> >> >>>>>>     f =
> fct(getRuntimeContext.getBroadcastVariable[B](name).get(0))
> >> >>>>>>   }
> >> >>>>>>
> >> >>>>>>   def map(in: In): Out = f(in)
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>> which takes a function which gets the broadcast value
and then
> >> >>>>>> returns
> >> >>>>>> a function doing the actual map, and you would then
use it like
> >> >>>>>> this:
> >> >>>>>>
> >> >>>>>> def mapWithBroadcast[B, I, O: ClassTag : TypeInformation](
> >> >>>>>>     xs: DataSet[I], bc: DataSet[B])
> >> >>>>>>   (fct: (B) => (I) => O): DataSet[O] = {
> >> >>>>>>   xs.map(new MapWithBroadcast[B, I,
> >> >>>>>> O]("dummy")(fct)).withBroadcastSet(bc, "dummy")
> >> >>>>>> }
> >> >>>>>>
> >> >>>>>> And then...
> >> >>>>>>
> >> >>>>>> val x = env.fromCollection(Seq(1, 2, 3, 4, 5, 6, 7))
> >> >>>>>> val sum = x.reduce(_ + _)
> >> >>>>>>
> >> >>>>>> val foo = mapWithBroadcast(x, sum)(sum => (x: Int)
=> x - sum)
> >> >>>>>>
> >> >>>>>> Has something like this been considered at any point?
Or are you
> >> >>>>>> all
> >> >>>>>> at the point where you can type in the broadcasting
stuff while
> you
> >> >>>>>> sleep ;) ?
> >> >>>>>>
> >> >>>>>> -M
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Tue, May 26, 2015 at 1:14 PM, Mikio Braun
> >> >>>>>> <mikiobraun@googlemail.com>
> >> >>>>>> wrote:
> >> >>>>>> > Hi Theodore and Till,
> >> >>>>>> >
> >> >>>>>> > last Friday Till suggested I have a look at the
gradient code.
> >> >>>>>> > Took me
> >> >>>>>> > a while, but I think I got the main structure.
> >> >>>>>> >
> >> >>>>>> > I think I understand why the code is written how
it is, but
> just
> >> >>>>>> > speaking from a purely data science perspective,
I find that a
> >> >>>>>> > lot of
> >> >>>>>> > the code is superheavy on boilterplate code. The
> LossCalculation
> >> >>>>>> > and
> >> >>>>>> > so on RichMapFunctions in GradientDescent easily
take up 20
> lines
> >> >>>>>> > of
> >> >>>>>> > code just to unpacka few broadcast variable and
call a single
> >> >>>>>> > function. I understand why we need to use broadcast
variables
> >> >>>>>> > here,
> >> >>>>>> > and that this is a common idiom in Flink, but
I wonder whether
> >> >>>>>> > one
> >> >>>>>> > cannot reduce the boilerplate code.
> >> >>>>>> >
> >> >>>>>> > In my experience, it is pretty important when
implementing ML
> >> >>>>>> > algorithms that the mapping back from the code
to the algorithm
> >> >>>>>> > is
> >> >>>>>> > still somewhat clear, because if you need to do
modifications
> or
> >> >>>>>> > debugging, it's important that you still know
what corresponds
> to
> >> >>>>>> > what, and with so many lines of code just moving
data around,
> >> >>>>>> > that's
> >> >>>>>> > hardly the case.
> >> >>>>>> >
> >> >>>>>> > A general design decision I would like to question
is the tight
> >> >>>>>> > integration of the specific loss functions, regularizers,
etc.
> >> >>>>>> > into
> >> >>>>>> > the GradientDescent code. Given that it is really
a pretty
> >> >>>>>> > general
> >> >>>>>> > approach which is applicable to many many different
models, I
> >> >>>>>> > think
> >> >>>>>> > It's better to have the stochastic gradient code
being as
> >> >>>>>> > agnostic as
> >> >>>>>> > possible about exactly what it is optimizing.
Instead, you
> would
> >> >>>>>> > pass
> >> >>>>>> > it a function which contains a method to compute
a function
> >> >>>>>> > value, and
> >> >>>>>> > its derivative, and possibly some action to regularize
the
> >> >>>>>> > parameters
> >> >>>>>> > (if this cannot be already done with the function
itself).
> >> >>>>>> >
> >> >>>>>> > Then, such a function for learning linear models
can be
> >> >>>>>> > configured in
> >> >>>>>> > the same way as it is right now. I personally
would opt for
> >> >>>>>> > optional
> >> >>>>>> > parameters instead of the builder-method style,
but that is ok
> >> >>>>>> > with
> >> >>>>>> > me.
> >> >>>>>> >
> >> >>>>>> > This would in particular mean that I would try
to eliminate the
> >> >>>>>> > case
> >> >>>>>> > statements as much as possible.
> >> >>>>>> >
> >> >>>>>> > The result would then be a bunch of classes around
the gradient
> >> >>>>>> > descent optimizer which very clearly just encode
that
> algorithm,
> >> >>>>>> > and a
> >> >>>>>> > bunch of classes around cost functions for linear
models which
> >> >>>>>> > encode
> >> >>>>>> > that, and both parts talk with one another only
through a
> small,
> >> >>>>>> > clean
> >> >>>>>> > interface. Plus, this interface could be reused
for other
> >> >>>>>> > optimization
> >> >>>>>> > algorithms like BFGS, or the Trust Region Newton
method we
> talked
> >> >>>>>> > about last time.
> >> >>>>>> >
> >> >>>>>> > So my suggestions:
> >> >>>>>> >   - think of some way to reduce boilerplate code
for broadcast
> >> >>>>>> > variables.
> >> >>>>>> >   - split gradient and the specific model for
linear models to
> >> >>>>>> > make
> >> >>>>>> > the individual parts cleaner and more extensible
> >> >>>>>> >
> >> >>>>>> > Hope this helps,
> >> >>>>>> >
> >> >>>>>> > -M
> >> >>>>>> >
> >> >>>>>> >
> >> >>>>>> > --
> >> >>>>>> > Mikio Braun - http://blog.mikiobraun.de,
> >> >>>>>> > http://twitter.com/mikiobraun
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Mikio Braun - http://blog.mikiobraun.de,
> >> >>>>>> http://twitter.com/mikiobraun
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>>>
> >> >>>>> info@data-artisans.com
> >> >>>>> phone +493055599146
> >> >>>>> mobile +4917671721261
> >> >>>>>
> >> >>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>>
> >> >>>> info@data-artisans.com
> >> >>>> phone +493055599146
> >> >>>> mobile +4917671721261
> >> >>>>
> >> >>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Data Artisans GmbH | Tempelhofer Ufer 17 | 10963 Berlin
> >> >>>
> >> >>> info@data-artisans.com
> >> >>> phone +493055599146
> >> >>> mobile +4917671721261
> >> >>>
> >> >>> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> >> >>> Managing Directors: Kostas Tzoumas, Stephan Ewen
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> >> >
> >> >
> >> >
> >> > --
> >> > Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> >>
> >>
> >>
> >> --
> >> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> >
> >
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message