Mailing-List: contact dev-help@hama.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hama.apache.org
Received-SPF: pass (nike.apache.org: domain of tommaso.teofili@gmail.com
 designates 74.125.82.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJ-=ysma3CTeQELQhT9AM=v4fTNCXwb3f2JTQiEdMdXRa2VOzQ@mail.gmail.com>
References: 
 <CAGnSx04r4UFuiZS3fC6jNAFW9dyRwEbzF2Z-0x_gFLD5L2jrVQ@mail.gmail.com>
 <CAJ-=ys=71mR09iryokF5yQDjUNMeXciuW3dCtAJRwb37UqXTVA@mail.gmail.com>
 <CAGnSx04mzMBH3UmPnu=CGO1PioKkPmYRyzeQ+sHM+HP47cLstA@mail.gmail.com>
 <CAJ-=ysn=36PC2deko7QS2bMSdGmYPsqehCOZWeHzYHPHGRS3Fg@mail.gmail.com>
 <CAGnSx04rqPb7Bm7Sn=bQRtaHLzLKEttx1U5z9CSZSW09YsDBGw@mail.gmail.com>
 <CAJ-=ys=d+4Ojo13MV366Yi44q+jv8_BQASP6AKDXG1VtLAwfGQ@mail.gmail.com>
 <CAGnSx04co63jeuwnFtv9EnBR8uR6iV-YKga6ETHbEGLYQ4snfA@mail.gmail.com>
 <CAJ-=ys=aQ2qBKri1zYpTinWxvcc7KoLbvDj=diQ-22W-A7kLeA@mail.gmail.com>
 <CAGnSx0772aDmNmxOw80SJP8Zk6jC0jXh-mcoWG2-z9tH8gf4aA@mail.gmail.com>
 <CAJ-=ysmBx22wexd_G3MtQPQrROQUD=fi5-3rMW-=Y9K3nxYF7Q@mail.gmail.com>
 <CAGnSx04ywJ41Lw_ZJaS206DG2-4VtLKZO5xCqc9x41hA5ypbSQ@mail.gmail.com>
 <CAJ-=ys=Gebkric20xOjwdyWAF=uuofXNg7pKDNt97fOAUuW2Hg@mail.gmail.com>
 <CAGnSx05zzeKwBN+yw5sX5f1+pBnUeg-Fm8qyuhGTN9xmEfEYww@mail.gmail.com>
 <CAGnSx05aOg+2QmjuLha5c0a39L9urbRvzv04m6D2M2ED=i9HLg@mail.gmail.com>
 <CAGQgZQRPx+4QLV_3qXwVzcGqscbhxsfSKHdFwHoLG7L0AFHrVg@mail.gmail.com>
 <CAJ-=yskaFsmnDyah9yAnSHbZQnCWgOKG3-qHoP1jfYht_AKc0Q@mail.gmail.com>
 <CAGQgZQTk-mSAGkNKD7TF8=D5EGuTWwzj9a9tgpNL1sr8mPqSkw@mail.gmail.com>
 <CAJ-=ysma3CTeQELQhT9AM=v4fTNCXwb3f2JTQiEdMdXRa2VOzQ@mail.gmail.com>
From: Tommaso Teofili <tommaso.teofili@gmail.com>
Date: Wed, 11 Jul 2012 13:28:55 +0200
Message-ID: 
 <CAGnSx05vSP+FeZzkGV_+Tv5RkaAS8Zx=5NGNVtbXkr7Km4w9Zg@mail.gmail.com>
Subject: Re: [ML] - data storage and basic design approach
To: dev@hama.apache.org
Content-Type: multipart/alternative; boundary=f46d0442806664bf3f04c48c2907

--f46d0442806664bf3f04c48c2907
Content-Type: text/plain; charset=ISO-8859-1

maybe for Miklai it'd be good to just keep his math/matrix classes, then
once he's finished we could eventually merge them together in a dedicated
math module.
My 2 cents,
Tommaso

2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>

> I have told him that he could use it, he uses a different approach.
> You told that we can later merge when he is ready.
> First come, first serve.
>
> 2012/7/10 Edward J. Yoon <edwardyoon@apache.org>
>
> > My concern is that this looks like duplicated efforts with Miklai.
> >
> > I think it needs to be organized.
> >
> > On Tue, Jul 10, 2012 at 8:26 PM, Thomas Jungblut
> > <thomas.jungblut@gmail.com> wrote:
> > > Splitting out a math module would be smarter, but let's just keep that
> in
> > > the ML package.
> > >
> > > Anyone volunteer to code a simple (mini-) batch gradient descent in
> BSP?
> > > http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html
> > >
> > >
> > > 2012/7/10 Edward J. Yoon <edwardyoon@apache.org>
> > >
> > >> would like to move core module so that other can reuse it.
> > >>
> > >> On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili
> > >> <tommaso.teofili@gmail.com> wrote:
> > >> > I've done the first import, we can start from that now, thanks
> Thomas.
> > >> > Tommaso
> > >> >
> > >> > 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >
> > >> >> ok, I'll try that, thanks :)
> > >> >> Tommaso
> > >> >>
> > >> >> 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> >>
> > >> >>> I don't know if we need sparse/named vectors for the first
> scratch.
> > >> >>> You can just use the interface and the dense implementations and
> > remove
> > >> >>> all
> > >> >>> the uncompilable code in the writables.
> > >> >>>
> > >> >>> 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >>>
> > >> >>> > Thomas, while inspecting the code I realize I may need to import
> > >> >>> most/all
> > >> >>> > of the classes inside your math library for the writables to
> > compile,
> > >> >>> is it
> > >> >>> > ok for you or you don't want that?
> > >> >>> > Regards,
> > >> >>> > Tommaso
> > >> >>> >
> > >> >>> > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> >>> >
> > >> >>> > > great, thank you for taking care of it ;)
> > >> >>> > >
> > >> >>> > > 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >>> > >
> > >> >>> > > > Ok, sure, I'll just add the writables along with
> > >> DoubleMatrix/Vector
> > >> >>> > with
> > >> >>> > > > the AL2 headers on top.
> > >> >>> > > > Thanks Thomas for the contribution and feedback.
> > >> >>> > > > Tommaso
> > >> >>> > > >
> > >> >>> > > > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> >>> > > >
> > >> >>> > > > > Feel free to commit this, but take care to add the apache
> > >> license
> > >> >>> > > > headers.
> > >> >>> > > > > Also I wanted to add a few testcases over the next few
> > >> weekends.
> > >> >>> > > > >
> > >> >>> > > > > 2012/7/10 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >>> > > > >
> > >> >>> > > > > > nice idea, quickly thinking to it it looks to me that
> > (C)GD
> > >> is a
> > >> >>> > good
> > >> >>> > > > fit
> > >> >>> > > > > > for BSP.
> > >> >>> > > > > > Also I was trying to implement some easy meta learning
> > >> algorithm
> > >> >>> > like
> > >> >>> > > > the
> > >> >>> > > > > > weighed majority algorithm where each peer as a proper
> > >> learning
> > >> >>> > > > algorithm
> > >> >>> > > > > > and gest penalized for each mistaken prediction.
> > >> >>> > > > > > Regarding your math library do you plan to commit it
> > >> yourself?
> > >> >>> > > > Otherwise
> > >> >>> > > > > I
> > >> >>> > > > > > can do it.
> > >> >>> > > > > > Regards,
> > >> >>> > > > > > Tommaso
> > >> >>> > > > > >
> > >> >>> > > > > >
> > >> >>> > > > > > 2012/7/10 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> >>> > > > > >
> > >> >>> > > > > > > Maybe a first good step towards algorithms would be to
> > try
> > >> to
> > >> >>> > > > evaluate
> > >> >>> > > > > > how
> > >> >>> > > > > > > we can implement some non-linear optimizers in BSP.
> > (BFGS
> > >> or
> > >> >>> > > > conjugate
> > >> >>> > > > > > > gradient method)
> > >> >>> > > > > > >
> > >> >>> > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >>> > > > > > >
> > >> >>> > > > > > > > 2012/7/9 Thomas Jungblut <thomas.jungblut@gmail.com
> >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > > > For the matrix/vector I would propose my library
> > >> >>> interface:
> > >> >>> > > > (quite
> > >> >>> > > > > > like
> > >> >>> > > > > > > > > mahouts math, but without boundary checks)
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > >
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > >
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
> > >> >>> > > > > > > > > Full Writable for Vector and basic Writable for
> > Matrix:
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > >
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > > It is an enough to make all machine learning
> > algorithms
> > >> >>> I've
> > >> >>> > > seen
> > >> >>> > > > > > until
> > >> >>> > > > > > > > now
> > >> >>> > > > > > > > > and the builder pattern allows really nice
> chaining
> > of
> > >> >>> > commands
> > >> >>> > > > to
> > >> >>> > > > > > > easily
> > >> >>> > > > > > > > > code equations or translate code from
> matlab/octave.
> > >> >>> > > > > > > > > See for example logistic regression cost function
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > >
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >>
> >
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
> > >> >>> > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > > very nice, +1!
> > >> >>> > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > > For the interfaces of the algorithms:
> > >> >>> > > > > > > > > I guess we need to get some more experience, I can
> > not
> > >> >>> tell
> > >> >>> > how
> > >> >>> > > > the
> > >> >>> > > > > > > > > interfaces for them should look like, mainly
> > because I
> > >> >>> don't
> > >> >>> > > know
> > >> >>> > > > > how
> > >> >>> > > > > > > the
> > >> >>> > > > > > > > > BSP version of them will call the algorithm logic.
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > > you're right, it's more reasonable to just proceed
> > >> bottom -
> > >> >>> up
> > >> >>> > > with
> > >> >>> > > > > > this
> > >> >>> > > > > > > as
> > >> >>> > > > > > > > we're going to have a clearer idea while developing
> > the
> > >> >>> > different
> > >> >>> > > > > > > > algorithms.
> > >> >>> > > > > > > > So for now I'd introduce your library Writables and
> > then
> > >> >>> > proceed
> > >> >>> > > 1
> > >> >>> > > > > step
> > >> >>> > > > > > > at
> > >> >>> > > > > > > > a time with the more common API.
> > >> >>> > > > > > > > Thanks,
> > >> >>> > > > > > > > Tommaso
> > >> >>> > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > > But having stable math interfaces is the key
> point.
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > > 2012/7/9 Tommaso Teofili <
> tommaso.teofili@gmail.com
> > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > > > > Ok, so let's sketch up here what these
> interfaces
> > >> should
> > >> >>> > look
> > >> >>> > > > > like.
> > >> >>> > > > > > > > > > Any proposal is more than welcome.
> > >> >>> > > > > > > > > > Regards,
> > >> >>> > > > > > > > > > Tommaso
> > >> >>> > > > > > > > > >
> > >> >>> > > > > > > > > > 2012/7/7 Thomas Jungblut <
> > thomas.jungblut@gmail.com>
> > >> >>> > > > > > > > > >
> > >> >>> > > > > > > > > > > Looks fine to me.
> > >> >>> > > > > > > > > > > The key are the interfaces for learning and
> > >> >>> predicting so
> > >> >>> > > we
> > >> >>> > > > > > should
> > >> >>> > > > > > > > > > define
> > >> >>> > > > > > > > > > > some vectors and matrices.
> > >> >>> > > > > > > > > > > It would be enough to define the algorithms
> via
> > the
> > >> >>> > > > interfaces
> > >> >>> > > > > > and
> > >> >>> > > > > > > a
> > >> >>> > > > > > > > > > > generic BSP should just run them based on the
> > given
> > >> >>> > input.
> > >> >>> > > > > > > > > > >
> > >> >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili <
> > >> tommaso.teofili@gmail.com>
> > >> >>> > > > > > > > > > >
> > >> >>> > > > > > > > > > > > Hi all,
> > >> >>> > > > > > > > > > > >
> > >> >>> > > > > > > > > > > > in my spare time I started writing some
> basic
> > BSP
> > >> >>> based
> > >> >>> > > > > machine
> > >> >>> > > > > > > > > > learning
> > >> >>> > > > > > > > > > > > algorithms for our ml module, now I'm
> > wondering,
> > >> >>> from a
> > >> >>> > > > > design
> > >> >>> > > > > > > > point
> > >> >>> > > > > > > > > of
> > >> >>> > > > > > > > > > > > view, where it'd make sense to put the
> > training
> > >> >>> data /
> > >> >>> > > > model.
> > >> >>> > > > > > I'd
> > >> >>> > > > > > > > > > assume
> > >> >>> > > > > > > > > > > > the obvious answer would be HDFS so this
> > makes me
> > >> >>> think
> > >> >>> > > we
> > >> >>> > > > > > should
> > >> >>> > > > > > > > > come
> > >> >>> > > > > > > > > > > with
> > >> >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm:
> > one
> > >> for
> > >> >>> > > > learning
> > >> >>> > > > > > and
> > >> >>> > > > > > > > one
> > >> >>> > > > > > > > > > for
> > >> >>> > > > > > > > > > > > "predicting" each to be run separately.
> > >> >>> > > > > > > > > > > > This would allow to read the training data
> > from
> > >> >>> HDFS,
> > >> >>> > and
> > >> >>> > > > > > > > > consequently
> > >> >>> > > > > > > > > > > > create a model (also on HDFS) and then the
> > >> created
> > >> >>> > model
> > >> >>> > > > > could
> > >> >>> > > > > > be
> > >> >>> > > > > > > > > read
> > >> >>> > > > > > > > > > > > (again from HDFS) in order to predict an
> > output
> > >> for
> > >> >>> a
> > >> >>> > new
> > >> >>> > > > > > input.
> > >> >>> > > > > > > > > > > > Does that make sense?
> > >> >>> > > > > > > > > > > > I'm just wondering what a general purpose
> > design
> > >> for
> > >> >>> > Hama
> > >> >>> > > > > based
> > >> >>> > > > > > > ML
> > >> >>> > > > > > > > > > stuff
> > >> >>> > > > > > > > > > > > would look like so this is just to start the
> > >> >>> > discussion,
> > >> >>> > > > any
> > >> >>> > > > > > > > opinion
> > >> >>> > > > > > > > > is
> > >> >>> > > > > > > > > > > > welcome.
> > >> >>> > > > > > > > > > > >
> > >> >>> > > > > > > > > > > > Cheers,
> > >> >>> > > > > > > > > > > > Tommaso
> > >> >>> > > > > > > > > > > >
> > >> >>> > > > > > > > > > >
> > >> >>> > > > > > > > > >
> > >> >>> > > > > > > > >
> > >> >>> > > > > > > >
> > >> >>> > > > > > >
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >>
> > >> --
> > >> Best Regards, Edward J. Yoon
> > >> @eddieyoon
> > >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

--f46d0442806664bf3f04c48c2907--