commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From luc.maison...@free.fr
Subject Re: [math] Questions about the linear package
Date Wed, 14 Oct 2009 10:01:41 GMT

----- "Jake Mannix" <jake.mannix@gmail.com> a écrit :

> Greetings, commons-math!
> 
>   I've been looking at a variety of apache/bsd-licensed linear
> libraries for
> use in massively parallel machine-learning applications I've been
> working on
> (I am housing my own open-source library at
> http://decomposer.googlecode.com,
> and am looking at integrating with/using/contributing to Apache
> Mahout), and
> I'm wondering a little about the linear API there is here in
> commons-math:
> 
>   * also for RealVector - No iterator methods?  So if the
> implementation is
> sparse, there's no way to just iterate over the non-zero entries? 
> What's
> worse, you can't even subclass OpenMapVector and expose the iterator
> on the
> OpenIntToDoubleHashMap inner object, because it's private. :\

Good idea. You can use JIRA <https://issues.apache.org/jira/browse/MATH> to register
a request for implementing this. Patches are of course welcome.
There should probably be two iterators: one for all entries and one for the non-default entries
(which may be non-zeroes or non-NaN or anything else).

> 
>   * for RealVector - what's with the million-different methods
> mapXXX(),
> mapXXXtoSelf()?  Why not just map(UnaryFunction()), and
> mapToSelf(UnaryFunction()), where UnaryFunction defines the single
> method
> double apply(double d); ?  Any user who wishes to implement RealVector
> (to
> say, make a more efficient specialized SparseVector) has to go through
> the
> pain of writing up a million methods dealing with these (and even if
> copy/paste gets most of this,  it still leads to some horribly huge
> .java
> files filled with junk that does not appear to be used).  There does
> not
> even appear to be an AbstractRealVector which implements all of these
> for
> you (by using the above-mentioned iterator() ).

This API is set up the way I get it from an external contributor, so I guess he had a use
case for that. I extended it to remain in the same spirit and get this huge mess. I'm sorry
for that. I agree a more generic method would be interesting. Removing these methods would
however introduce an incompatible API change, so this could be done only in a major release
(i.e. 3.0) which is probably a long time from now.

The generic method should also either be provided in two versions (all entries and non-default
entries) or it should have an iterator argument. For example the cosine and exponential functions
transform a zero entry into a non-zero entry so they cannot ignore zero entries.

> 
>   * while we're at it, if there is map(), why not also double
> RealVector.collect(Collector()), where Collector defines void
> collect(int
> index, double value); and double result(); - this can be used for
> generic
> inner products and kernels (and can allow for consolidating all of
> the
> L1Norm(), norm(), and LInfNorm() methods into this same method,
> passing in
> different L1NormCollector() etc... instances).

Godd idea too. Another JIRA ticket for that ?

> 
>   * why all the methods which are overloaded to take either RealVector
> or
> double[] (getDistance, dotProduct, add, etc...) - is there really that
> much
> overhead in just implementing dotProduct(double[] d)  as just
> dotProduct(new
> ArrayRealVector(d, false)); - no copy is done, nothing is done but
> one
> object creation...

It's not the copy that could take time, but the iteration which needs to call getEntry().
So yes, there is some overhead and it can be avoided by providing the simple array version.
Of course, a default implementation that wraps the array into an ArrayRealVector can be added
to the AbstractRealVector class you proposed above, in order to simplify new implementations.

> 
>   * SparseVector is just a marker interface?  Does it serve any
> purpose?

For now, yes it is a marker interface. There was some discussion about these interfaces just
before the release of 2.0. the conclusion was that they should remain semple markers at that
time.

> 
> I guess I could ask similar questions on the Matrix interfaces, but
> maybe
> those will probably be cleared up by understanding the philosophy
> behind the
> Vector interfaces.
> 
> I'd love to use commons-math for parts of my projects in which the
> entire
> data sets can live in memory (often part of the computation falls into
> this
> category, even if it's not the most meaty part, it's big enough that
> I'll
> kill my performance if I am stuck writing my own subroutines for
> eigen
> computation, etc for many moderately small matrices), but converting
> two and
> from the commons-math linear interfaces seem a bit unweildy.  Maybe it
> would
> be easier if I could understand why these are the way they are.

The idea was really that people could provide their own implementations. Some methods that
are close in spirit to the iterators you ask for are in the matrix interfaces (the walkXxx
methods) and are used in many algorithms inside [math].

> 
> I'm happy to contribute patches consolidating interfaces and/or
> extending

Fine. We are always happy to see a community growing around our components.

> functionality (you seem to be missing a compact int/double pair
> implementation of sparse vectors, for example, which are a
> fantasticly
> performant format if they're immutable and only being used for dot
> products
> and adding them to dense vectors), if it would be of help (I'm
> tracking my
> attempts at this over on my GitHub clone of trunk:
> http://github.com/jakemannix/commons-math ).

If you intend to contribute them to [math], you'll have to put them on JIRA and send a Software
Grant <http://www.apache.org/licenses/#grants> to Apache secretary. If you develop contributions
directly for [math] (i.e. if it is not preexisting software), then rather than a Software
Grant we will need either a Contributor License Agreement (CLA), either an Individual CLA
or a Corporate CLA <http://www.apache.org/licenses/#clas>.

Thanks
Luc

> 
>   -jake mannix
>   Principal Software Engineer
>   Search and Recommender Systems
>   LinkedIn.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message