commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles Sadowski <gil...@harfang.homelinux.org>
Subject Re: [Math] Old to new API ("MultivariateDifferentiable(Vector)Function")
Date Sat, 01 Dec 2012 22:17:35 GMT
Hello.

> > 
> >> 
> >> Now that I have some time, let me try to make my case clearly. First I want
to say that this is not some attack on the automatic-differentation package. I love automatic-differentation
and symbolic packages. I personally cannot compute a derivative without a computer for the
life of me. And I am also glad we are finally starting to agree on something :)
> > 
> > I don't think that we disagreed about the fact that there was a problem with
> > the interface to the user application. [_I_ started this thread hinting that
> > there was a problem (at least for me as a user, based on my use-case).]
> > 
> > [Then there were several misunderstandings about what was the problem and how
> > to solve it...]
> > 
> >> 
> >> This discussion is about figuring out how an incorrect framework and storage
affects the performance of an optimization. That is why I am so worried.
> > 
> > I can understand that this is an important point for your use-case.
> > There are now two vantage points (at least) on two aspects to consider: You
> > seem to have experience where storage matter (while I don't); you worry
> > about "small" objects allocation (while I don't, anymore).
> > 
> > I'm not saying that your worries do not count; quite the contrary. CM is not
> > my "toolbox"; it's a library based on the knowledge of its developers
> > (obviously).
> > If you bring actual use-cases (i.e. unit tests or real application code
> > that can be benchmarked) that will show worrying behaviour, it will
> > certainly trigger swift corrective action. [This has happened recently.]
> > 
> > In this area (performance), the problem is that intuition (or educated
> > guess, however you want to name it) is not a substitute for actual runs,
> > sometimes by a large amount. [When I started to use CM, I raised the issue
> > that a 3D vector was immutable (so that I had to create a new one whenever
> > its coordinates were changing). Surely this was a performance hit! Actually
> > it wasn't.]
> 
> This is not the same case. You take a function that a user normally expresses as a two
dimensional array or a vector, you force them to allocate 10K+ new objects that you then have
to unwrap back into the structure the user would have happily supplied you in the first place.
The second issue you missed is one of scale. The difference between modifying a 3 variable
array and creating a copy is not large. Try doing this with a 10k+ vector, where you don't
actually need to modify any of the entries but are just doing copies for the hell of it. This
is a known critical component of an optimization and should be optimized for performance itself.

It's always not the same case (really). The crux of the example was about
(my) misconception about what takes time in a JVM.
In CM, there are tons of unneeded (from your point-of-view) copying because
in most cases, it was deemed worth in order to ensure the integrity of the
subsequent processing.
And we also noticed that in the "usual" (the developers' use-cases), it did
not matter much. [And where it does, then we try to take it into account.]

> 
> > 
> > Again, that does not mean that you are wrong in this case. It's just that we
> > cannot jump and change the code every time someone comes up with what
> > amounts to "conventional wisdom". If the person comes with a patch that
> > readily proves the point (at least marginally through microbenchmarking),
> > then we all benefit.
> > 
> 
> This is not "conventional wisdom", this is 60 years of research, so you better provide
a good reason for you increased level of abstraction. I asked you why you would indirectly
wrap a Hessian or a Jacobian in a much heaver class that provides no useful features for a
newton optimizer. I don't think you gave me an answer. I do not believe that good OO design
is to have as many abstract layers as you can think of. Good OO design is just like any good
engineering design, its about having things clean, simple, and not be dependent on components
that are not required. In addition, if you are working on a code that is called thousands
of times, you should think really careful about memory performance.

I only talked about code reuse (internally). This should not have impacted
users (i.e. require them to learn about "DerivativeStructure"). Luc already
said that _this_ (what I say in the provious sentence) is was not necessary.
We are not trying to get one step further. Please forget about
"DerivativeStructure".

> 
> >> 
> >> So lets start with the basics. A Newton method must compute a descent step by
solving a linear equation
> >> 
> >> H*p = -g, (1)
> >> 
> >> where H is the Hessian, g is the gradient, and p is the desired descent step.
What I am about to say also holds for non-linear least squares method, where Hessian is approximated
using the Jacobian as
> >> 
> >> H \approx J^T*J+\lambda I.
> >> 
> >> Now, if you are not solving a simple optimization problem that you keep giving
examples for, you can easily have a Hessian be a very large matrix,
> >> like 1000x1000 or larger. Now, you better hope that you are storing H using
your linear algebra framework, otherwise eq. (1) computation is going to take a while.
> >> This is actually a very active area of research, and that is why having sparse
linear algebra (aren't you removing this? ;) ) and iterative solvers is important to a lot
of people.
> > 
> > Yes we are removing it because it is buggy and _nobody_ stepped up to say
> > that it was important for CM users, and to help solve the problems in a way
> > consistent with real-life usage of such a feature.
> > As S├ębastien said, you are warmly welcome to help us find a way to keep the
> > feature.
> 
> I personally do not use sparse linear algebra. I was just pointing out how important
computation of eq. 1 is in general. I wish I had the time to help :(
> 
> > 
> >> 
> >> What you are proposing as good OO style 
> > 
> > This discussion has really nothing to do with "OO style", merely code reuse;
> > and it was my big misunderstanding (partly because of my lack of knowledge,
> > partly because of the lack of documentation on the targetted usages of
> > "DerivativeStructure", which IIUC now, are outside CM) that I believed
> > that the new "MultivariateDifferentiableFunction" was the way to go.
> > 
> > [Also, Luc had been moving very fast on this. And I couldn't keep up
> > although I had wondered earlier why this had to impact usage in the
> > "optimization" package.]
> > 
> >> is that if someone has a function that they want to optimize, they need to take
what is probably already a RealMatrix or [][], and create 1000x1000 DerivativeStructure objects,
that,
> > 
> > IIUC, that would be 1000 "DerivativeStructure" (not 1000x1000). If so, for
> > this example, using "DerivativeStructure" would lead to about a 4% increase
> > in storage memory.
> 
> Ok 1000. I guess it's really hard to understand this DerivativeStructure without further
documentation. You can change my problem to least-squares problem with a very large number
of observables. Same problem will pop-up again.

Maybe, maybe not.
I think that it's better to drop the issue of performance. Without figures,
it's useless. [See, for example, issue MATH-740 where we have figures, yet
nobody stepped up to improve the situation.]

> 
> > 
> >> within the next 10 lines in the optimizer, are going to be converted back to
a RealMatrix? Not only have you just fragmented your heap space, but your garbage collector
> >> is going crazy, and you are wasting an incredible amount of memory.
> > 
> > This is the kind of assertions that really needs support from actual code.
> > [Again, I don't claim to know better; I think that it would really be useful
> > to accumulate a list of "do and don't" for Java and CM, in the form of
> > reproducible user experience.]
> > 
> >> Now imagine if your Jacobian is actually very simple to compute but large, then
this whole conversion back and forth actually takes much longer than the function evaluation.
> > 
> > We are willing to take this into account, but since I cannot see the
> > behaviour in my use-case (where the evaluation time overwhelms, by several
> > orders of magnitude, all such considerations), I do not have any incentive
> > to change what seems to be "efficient enough" (for the actually known
> > cases).
> > Again, your experience with other use-cases would be very valuable to
> > analyze the efficiency of the CM implementations and improve them, if
> > necessary.
> > 
> >> 
> >> So, why exactly are you insisting on taking this performance penalty?
> > 
> > It's the other way around. Until the performance penalty is proven, we've
> > decided that it's up to the developer willing to do the work to decide.
> > Again, if there is patch, it makes the matter much easier to decide on.
> > 
> > I admit that we decided to change the internal working of the optimizers, so
> > _we_ should have proved that it does not impact usability. Hence, again, the
> > usefulness of having a test suite of "real" applications which could be
> > benchmarked regularly in order to have performance regressions induced by
> > otherwise welcome changes.
> > [I raised that issue several times on the ML.]
> 
> It's not about usability. When you tests your components you should not
> test on small problems, because any bad software will work on that.
> Create a case where you have 10K+ observations and 10K+ variable and see
> how your change effects the solution. I wish I would have time to do it
> myself, but I don't.

Ditto.

> As a user I can tell you that scalability is an important issue.

I agree. So what?

> 
> > 
> >> You claim that the optimizer can work better if it gets a DerivativeStructure,
why?
> > 
> > This was (supposed to be) an improvement of the _internal_ design. [Several
> > weeks ago, Luc posted the problems he was encountering.]
> > 
> >> Where is the fact that you are holding a DerivativeStructure come into effect
for a Newton method? Can you provide any literature in that regard? The classical optimization
bottleneck, if not the actual function evaluation, is eq. (1). The optimization methodology
is build on top of years of research in computational linear algebra concepts, and does not
care how the matrices are actually computed. Efficient memory usage and speed is critical
here because in Newton optimizations eq. (1) is evaluated thousands of times. The goal of
the Jacobian is not to store derivatives it is to store a matrix of numbers that allows you
to quadratically approximate your target function.
> > 
> > OK. Then Luc's proposal seems appropriate.
> > There would be new interfaces defining the "optimize" method appropriately.
> > For algorithms that need the gradient, it must be provided in an additional
> > argument, of type "MultivariateVectorFunction"; for those that need, the
> > Jacobian, the argument would be of type "MultivariateMatrixFunction".
> > Do you agree?
> > 
> 
> I keep saying that there are cases when evaluating value gradient Hessian is faster together
than as a separate function. So no, I do not agree. I do agree it is better than having DerivativeStructure,
but I think it is worse than what you had in 3.0. The proposal is pretty much like before
in 3.0, but now I need to create two classes every time I want to optimize a function. Why
are you doing this? I don't understand why this change needs to happen. 
> 
> I don't see a problem at all. All that has to happen is that the function in gradient
based methods overwrites the optimization function, such that it now takes a subclass of the
function that is used in non-gradient method. That sub-class would require a gradient function
to be implemented.

Sorry, I'm confused.
Could you please spell out _in code_ what you want implemented, or just tell
what classes are good as they are and which changes are bad, in the current
developement version?

> 
> >> 
> >> You guys are focusing on people using finite differences and trying to optimize
a newton method to use finite differences. This is not what the main purpose of a newton method
is. If you want a method that dynamically adjusts finite differences step size, you should
switch to BOBYQA, which was designed for this case.
> > 
> > Another misunderstanding probably. [The "stepSize" discussion was sort of
> > a digression.]
> > 
> >> 
> >> Derivatives can be computed by a computer using a symbolic toolbox a priori
(something that I was referring to when I accidentally said auto-differenation). They can
also be efficiently simplified by that toolbox before being pasted back into you program.
Auto-diff could also be an important tool for people if their problem is simple or they are
not concerned with optimal efficiency . This can easily be handled by a wrapper and has nothing
to do with Newton optimization.
> > 
> > Maybe we should also change the "NewtonSolver" (in package
> > "o.a.c.m.analysis.solvers"). [In the same way we'll do for the optimizer
> > with derivatives. Actually those two packages were improved in parallel so
> > that the interface would be very similar from a usage point of view.]
> > 
> 
> Not sure what you mean.
> 
> >> If you want to talk about proper OO design and internal simplification then
you should focus on being able to pass a linear solver into your optimization method. This
way you allow people to use iterative and sparse solvers when needed.
> > 
> > This is a new feature; contributions are welcome. [CM is primarily
> > developed by people who use (part of) it; if they don't need that feature,
> > or don't know how to implement it, or do not have the time to implement it,
> > it won't be implemented.]
> > 
> >> If you are concerned about people getting derivatives wrong, you can do what
MATLAB does, which is to add an option to check the derivatives using finite differences when
debugging. 
> > 
> > IMHO, the top-priority feature to help in debugging is to introduce logging
> > statements!
> 
> I would go further than that. I think the user should be able to get back all the iterations
steps back, and the associated rmsd values, etc if so desired. But that is sort of a wish.

[I don't agree, but this is another discussion.
With logging, you'd be able to get this information without cluttering CM
with data structure for saving transient states.]

> > 
> >> 
> >> This brings me to my second issue. There are important problems where computation
> >> of the derivatives combined with the actual function evaluation is *significantly*
> >> faster than computing each alone. I am not clear why I am hitting such resistance
> >> with this. [...]
> > 
> > [If you think that you were met with strong resistance, I'd suggest that you
> > look in the archives for the discussions about exceptions in CM...]
> > 
> > Unless I'm mistaken, CM does not prevent you to write a class that takes
> > advantage of combining the computations.
> 
> How would I write that class? You optimizer first requests function value , then separately
the Jacobian. It is not done at the same time. I could try to cache intermediately results,
see if you are requesting a Jacobian for the same values you called my value function for,
but that is very dirty.

In text, that looks dirty. ;-)
Please provide the code snippets which I request above, and I surely hope
that we'll meet at some point. Let me remind you that I probably need the
same thing as you! :-)

> 
> > 
> >> What I suggested as function interfaces was just an initial quick suggestion
> >> that I would be happy for anyone to change in a way that everyone feels positive
> >> about. I just don't want my second issue to be ignored like a non-issue.
> > 
> > I'm getting confused. Could you please restate what where those first and
> > second issue?
> 
> First issue is you are creating extra layers of conversion for the user that is
> bad for performance,

OK. Thanks. That is a misunderstanding. It's certainly not the goal of those
recent changes. And if there is a performance hit in the upcoming CM 3.1, I
can assure you that it will corrected in 3.2.

> and there is not a single specific reason given for why its a good idea.

...

> Second issue is that computing value gradient Hessian might be faster for
> some functions together rather than in two unconnected calls.

OK. Good point. Was it possible before (in version 3.0)? How did you do it?
(cf. code snippet request). Then we'll make sure it will work in 3.1 too
(albeit with small syntax changes if needed for our internal cleanup of the
package). How does that sound?


Best regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message