hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@gmail.com>
Subject Re: [ML] - data storage and basic design approach
Date Tue, 10 Jul 2012 10:26:45 GMT
very nice, thank you very much

2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>

> I've done the first import, we can start from that now, thanks Thomas.
> Tommaso
>
> 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
>
> > ok, I'll try that, thanks :)
> > Tommaso
> >
> > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> >
> >> I don't know if we need sparse/named vectors for the first scratch.
> >> You can just use the interface and the dense implementations and remove
> >> all
> >> the uncompilable code in the writables.
> >>
> >> 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> >>
> >> > Thomas, while inspecting the code I realize I may need to import
> >> most/all
> >> > of the classes inside your math library for the writables to compile,
> >> is it
> >> > ok for you or you don't want that?
> >> > Regards,
> >> > Tommaso
> >> >
> >> > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> >
> >> > > great, thank you for taking care of it ;)
> >> > >
> >> > > 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> > >
> >> > > > Ok, sure, I'll just add the writables along with
> DoubleMatrix/Vector
> >> > with
> >> > > > the AL2 headers on top.
> >> > > > Thanks Thomas for the contribution and feedback.
> >> > > > Tommaso
> >> > > >
> >> > > > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > >
> >> > > > > Feel free to commit this, but take care to add the apache
> license
> >> > > > headers.
> >> > > > > Also I wanted to add a few testcases over the next few weekends.
> >> > > > >
> >> > > > > 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> > > > >
> >> > > > > > nice idea, quickly thinking to it it looks to me that
(C)GD
> is a
> >> > good
> >> > > > fit
> >> > > > > > for BSP.
> >> > > > > > Also I was trying to implement some easy meta learning
> algorithm
> >> > like
> >> > > > the
> >> > > > > > weighed majority algorithm where each peer as a proper
> learning
> >> > > > algorithm
> >> > > > > > and gest penalized for each mistaken prediction.
> >> > > > > > Regarding your math library do you plan to commit it
yourself?
> >> > > > Otherwise
> >> > > > > I
> >> > > > > > can do it.
> >> > > > > > Regards,
> >> > > > > > Tommaso
> >> > > > > >
> >> > > > > >
> >> > > > > > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > > > >
> >> > > > > > > Maybe a first good step towards algorithms would
be to try
> to
> >> > > > evaluate
> >> > > > > > how
> >> > > > > > > we can implement some non-linear optimizers in
BSP. (BFGS or
> >> > > > conjugate
> >> > > > > > > gradient method)
> >> > > > > > >
> >> > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> > > > > > >
> >> > > > > > > > 2012/7/9 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > > > > > >
> >> > > > > > > > > For the matrix/vector I would propose
my library
> >> interface:
> >> > > > (quite
> >> > > > > > like
> >> > > > > > > > > mahouts math, but without boundary checks)
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
> >> > > > > > > > > Full Writable for Vector and basic Writable
for Matrix:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
> >> > > > > > > > >
> >> > > > > > > > > It is an enough to make all machine
learning algorithms
> >> I've
> >> > > seen
> >> > > > > > until
> >> > > > > > > > now
> >> > > > > > > > > and the builder pattern allows really
nice chaining of
> >> > commands
> >> > > > to
> >> > > > > > > easily
> >> > > > > > > > > code equations or translate code from
matlab/octave.
> >> > > > > > > > > See for example logistic regression
cost function
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > very nice, +1!
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > For the interfaces of the algorithms:
> >> > > > > > > > > I guess we need to get some more experience,
I can not
> >> tell
> >> > how
> >> > > > the
> >> > > > > > > > > interfaces for them should look like,
mainly because I
> >> don't
> >> > > know
> >> > > > > how
> >> > > > > > > the
> >> > > > > > > > > BSP version of them will call the algorithm
logic.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > you're right, it's more reasonable to just
proceed bottom
> -
> >> up
> >> > > with
> >> > > > > > this
> >> > > > > > > as
> >> > > > > > > > we're going to have a clearer idea while
developing the
> >> > different
> >> > > > > > > > algorithms.
> >> > > > > > > > So for now I'd introduce your library Writables
and then
> >> > proceed
> >> > > 1
> >> > > > > step
> >> > > > > > > at
> >> > > > > > > > a time with the more common API.
> >> > > > > > > > Thanks,
> >> > > > > > > > Tommaso
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > But having stable math interfaces is
the key point.
> >> > > > > > > > >
> >> > > > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> > > > > > > > >
> >> > > > > > > > > > Ok, so let's sketch up here what
these interfaces
> should
> >> > look
> >> > > > > like.
> >> > > > > > > > > > Any proposal is more than welcome.
> >> > > > > > > > > > Regards,
> >> > > > > > > > > > Tommaso
> >> > > > > > > > > >
> >> > > > > > > > > > 2012/7/7 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > > > > > > > >
> >> > > > > > > > > > > Looks fine to me.
> >> > > > > > > > > > > The key are the interfaces
for learning and
> >> predicting so
> >> > > we
> >> > > > > > should
> >> > > > > > > > > > define
> >> > > > > > > > > > > some vectors and matrices.
> >> > > > > > > > > > > It would be enough to define
the algorithms via the
> >> > > > interfaces
> >> > > > > > and
> >> > > > > > > a
> >> > > > > > > > > > > generic BSP should just run
them based on the given
> >> > input.
> >> > > > > > > > > > >
> >> > > > > > > > > > > 2012/7/7 Tommaso Teofili <tommaso.teofili@gmail.com
> >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi all,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > in my spare time I started
writing some basic BSP
> >> based
> >> > > > > machine
> >> > > > > > > > > > learning
> >> > > > > > > > > > > > algorithms for our ml
module, now I'm wondering,
> >> from a
> >> > > > > design
> >> > > > > > > > point
> >> > > > > > > > > of
> >> > > > > > > > > > > > view, where it'd make
sense to put the training
> >> data /
> >> > > > model.
> >> > > > > > I'd
> >> > > > > > > > > > assume
> >> > > > > > > > > > > > the obvious answer would
be HDFS so this makes me
> >> think
> >> > > we
> >> > > > > > should
> >> > > > > > > > > come
> >> > > > > > > > > > > with
> >> > > > > > > > > > > > (at least) two BSP jobs
for each algorithm: one
> for
> >> > > > learning
> >> > > > > > and
> >> > > > > > > > one
> >> > > > > > > > > > for
> >> > > > > > > > > > > > "predicting" each to
be run separately.
> >> > > > > > > > > > > > This would allow to read
the training data from
> >> HDFS,
> >> > and
> >> > > > > > > > > consequently
> >> > > > > > > > > > > > create a model (also
on HDFS) and then the created
> >> > model
> >> > > > > could
> >> > > > > > be
> >> > > > > > > > > read
> >> > > > > > > > > > > > (again from HDFS) in
order to predict an output
> for
> >> a
> >> > new
> >> > > > > > input.
> >> > > > > > > > > > > > Does that make sense?
> >> > > > > > > > > > > > I'm just wondering what
a general purpose design
> for
> >> > Hama
> >> > > > > based
> >> > > > > > > ML
> >> > > > > > > > > > stuff
> >> > > > > > > > > > > > would look like so this
is just to start the
> >> > discussion,
> >> > > > any
> >> > > > > > > > opinion
> >> > > > > > > > > is
> >> > > > > > > > > > > > welcome.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Cheers,
> >> > > > > > > > > > > > Tommaso
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message