hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@gmail.com>
Subject Re: [ML] - data storage and basic design approach
Date Mon, 09 Jul 2012 16:13:56 GMT
For the matrix/vector I would propose my library interface: (quite like
mahouts math, but without boundary checks)
https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java

https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
Full Writable for Vector and basic Writable for Matrix:
https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable

It is an enough to make all machine learning algorithms I've seen until now
and the builder pattern allows really nice chaining of commands to easily
code equations or translate code from matlab/octave.
See for example logistic regression cost function
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java

For the interfaces of the algorithms:
I guess we need to get some more experience, I can not tell how the
interfaces for them should look like, mainly because I don't know how the
BSP version of them will call the algorithm logic.

But having stable math interfaces is the key point.

2012/7/9 Tommaso Teofili <tommaso.teofili@gmail.com>

> Ok, so let's sketch up here what these interfaces should look like.
> Any proposal is more than welcome.
> Regards,
> Tommaso
>
> 2012/7/7 Thomas Jungblut <thomas.jungblut@gmail.com>
>
> > Looks fine to me.
> > The key are the interfaces for learning and predicting so we should
> define
> > some vectors and matrices.
> > It would be enough to define the algorithms via the interfaces and a
> > generic BSP should just run them based on the given input.
> >
> > 2012/7/7 Tommaso Teofili <tommaso.teofili@gmail.com>
> >
> > > Hi all,
> > >
> > > in my spare time I started writing some basic BSP based machine
> learning
> > > algorithms for our ml module, now I'm wondering, from a design point of
> > > view, where it'd make sense to put the training data / model. I'd
> assume
> > > the obvious answer would be HDFS so this makes me think we should come
> > with
> > > (at least) two BSP jobs for each algorithm: one for learning and one
> for
> > > "predicting" each to be run separately.
> > > This would allow to read the training data from HDFS, and consequently
> > > create a model (also on HDFS) and then the created model could be read
> > > (again from HDFS) in order to predict an output for a new input.
> > > Does that make sense?
> > > I'm just wondering what a general purpose design for Hama based ML
> stuff
> > > would look like so this is just to start the discussion, any opinion is
> > > welcome.
> > >
> > > Cheers,
> > > Tommaso
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message