Return-Path: X-Original-To: apmail-hama-dev-archive@www.apache.org Delivered-To: apmail-hama-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CCD0695EE for ; Wed, 11 Jul 2012 11:30:08 +0000 (UTC) Received: (qmail 47565 invoked by uid 500); 11 Jul 2012 11:30:08 -0000 Delivered-To: apmail-hama-dev-archive@hama.apache.org Received: (qmail 47388 invoked by uid 500); 11 Jul 2012 11:30:04 -0000 Mailing-List: contact dev-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hama.apache.org Delivered-To: mailing list dev@hama.apache.org Received: (qmail 47349 invoked by uid 99); 11 Jul 2012 11:30:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jul 2012 11:30:03 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tommaso.teofili@gmail.com designates 74.125.82.47 as permitted sender) Received: from [74.125.82.47] (HELO mail-wg0-f47.google.com) (74.125.82.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jul 2012 11:29:56 +0000 Received: by wgbfa7 with SMTP id fa7so849750wgb.4 for ; Wed, 11 Jul 2012 04:29:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=pRmkIFuzhyTZsjWbLFsy2l9m+rT8MK6noNaLd54cecI=; b=xSSwsbeAIDyMHoCRfNTYDTlpHJkzJU3i/EWkd1kHIfptNCjgnwcYdvbcmMdeD2fSD4 4jIC0ZeI/vRmfLJ84Zo0z49Q4nkTtUOSy6qBRyIKxypTpWa+vq1L+/Hx8aloCdW/avVy 7BfUDLjwfw95O/p1xt+nC+SOH2jgvjC9XM9RZiYv5YwslBY2UUn7pPSjkwZngVn8Qqse C6TA+5Us9OWGHWJixEB9Wi6HScKhTBsI0HSn0G0odS3yjTZcbAZCRldx/U1571nEUfz8 qhCvMTPRXPQtrCK/e73jMnpH7Uy7RZX3djrfyBJi6xoXCYDXMZB2S8OmlLG9xzWqSkwu A1kw== Received: by 10.180.106.97 with SMTP id gt1mr46437937wib.5.1342006175776; Wed, 11 Jul 2012 04:29:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.180.77.100 with HTTP; Wed, 11 Jul 2012 04:28:55 -0700 (PDT) In-Reply-To: References: From: Tommaso Teofili Date: Wed, 11 Jul 2012 13:28:55 +0200 Message-ID: Subject: Re: [ML] - data storage and basic design approach To: dev@hama.apache.org Content-Type: multipart/alternative; boundary=f46d0442806664bf3f04c48c2907 --f46d0442806664bf3f04c48c2907 Content-Type: text/plain; charset=ISO-8859-1 maybe for Miklai it'd be good to just keep his math/matrix classes, then once he's finished we could eventually merge them together in a dedicated math module. My 2 cents, Tommaso 2012/7/10 Thomas Jungblut > I have told him that he could use it, he uses a different approach. > You told that we can later merge when he is ready. > First come, first serve. > > 2012/7/10 Edward J. Yoon > > > My concern is that this looks like duplicated efforts with Miklai. > > > > I think it needs to be organized. > > > > On Tue, Jul 10, 2012 at 8:26 PM, Thomas Jungblut > > wrote: > > > Splitting out a math module would be smarter, but let's just keep that > in > > > the ML package. > > > > > > Anyone volunteer to code a simple (mini-) batch gradient descent in > BSP? > > > http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html > > > > > > > > > 2012/7/10 Edward J. Yoon > > > > > >> would like to move core module so that other can reuse it. > > >> > > >> On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili > > >> wrote: > > >> > I've done the first import, we can start from that now, thanks > Thomas. > > >> > Tommaso > > >> > > > >> > 2012/7/10 Tommaso Teofili > > >> > > > >> >> ok, I'll try that, thanks :) > > >> >> Tommaso > > >> >> > > >> >> 2012/7/10 Thomas Jungblut > > >> >> > > >> >>> I don't know if we need sparse/named vectors for the first > scratch. > > >> >>> You can just use the interface and the dense implementations and > > remove > > >> >>> all > > >> >>> the uncompilable code in the writables. > > >> >>> > > >> >>> 2012/7/10 Tommaso Teofili > > >> >>> > > >> >>> > Thomas, while inspecting the code I realize I may need to import > > >> >>> most/all > > >> >>> > of the classes inside your math library for the writables to > > compile, > > >> >>> is it > > >> >>> > ok for you or you don't want that? > > >> >>> > Regards, > > >> >>> > Tommaso > > >> >>> > > > >> >>> > 2012/7/10 Thomas Jungblut > > >> >>> > > > >> >>> > > great, thank you for taking care of it ;) > > >> >>> > > > > >> >>> > > 2012/7/10 Tommaso Teofili > > >> >>> > > > > >> >>> > > > Ok, sure, I'll just add the writables along with > > >> DoubleMatrix/Vector > > >> >>> > with > > >> >>> > > > the AL2 headers on top. > > >> >>> > > > Thanks Thomas for the contribution and feedback. > > >> >>> > > > Tommaso > > >> >>> > > > > > >> >>> > > > 2012/7/10 Thomas Jungblut > > >> >>> > > > > > >> >>> > > > > Feel free to commit this, but take care to add the apache > > >> license > > >> >>> > > > headers. > > >> >>> > > > > Also I wanted to add a few testcases over the next few > > >> weekends. > > >> >>> > > > > > > >> >>> > > > > 2012/7/10 Tommaso Teofili > > >> >>> > > > > > > >> >>> > > > > > nice idea, quickly thinking to it it looks to me that > > (C)GD > > >> is a > > >> >>> > good > > >> >>> > > > fit > > >> >>> > > > > > for BSP. > > >> >>> > > > > > Also I was trying to implement some easy meta learning > > >> algorithm > > >> >>> > like > > >> >>> > > > the > > >> >>> > > > > > weighed majority algorithm where each peer as a proper > > >> learning > > >> >>> > > > algorithm > > >> >>> > > > > > and gest penalized for each mistaken prediction. > > >> >>> > > > > > Regarding your math library do you plan to commit it > > >> yourself? > > >> >>> > > > Otherwise > > >> >>> > > > > I > > >> >>> > > > > > can do it. > > >> >>> > > > > > Regards, > > >> >>> > > > > > Tommaso > > >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > 2012/7/10 Thomas Jungblut > > >> >>> > > > > > > > >> >>> > > > > > > Maybe a first good step towards algorithms would be to > > try > > >> to > > >> >>> > > > evaluate > > >> >>> > > > > > how > > >> >>> > > > > > > we can implement some non-linear optimizers in BSP. > > (BFGS > > >> or > > >> >>> > > > conjugate > > >> >>> > > > > > > gradient method) > > >> >>> > > > > > > > > >> >>> > > > > > > 2012/7/9 Tommaso Teofili > > >> >>> > > > > > > > > >> >>> > > > > > > > 2012/7/9 Thomas Jungblut > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > For the matrix/vector I would propose my library > > >> >>> interface: > > >> >>> > > > (quite > > >> >>> > > > > > like > > >> >>> > > > > > > > > mahouts math, but without boundary checks) > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> > > > https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> > > > https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java > > >> >>> > > > > > > > > Full Writable for Vector and basic Writable for > > Matrix: > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> > > > https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > It is an enough to make all machine learning > > algorithms > > >> >>> I've > > >> >>> > > seen > > >> >>> > > > > > until > > >> >>> > > > > > > > now > > >> >>> > > > > > > > > and the builder pattern allows really nice > chaining > > of > > >> >>> > commands > > >> >>> > > > to > > >> >>> > > > > > > easily > > >> >>> > > > > > > > > code equations or translate code from > matlab/octave. > > >> >>> > > > > > > > > See for example logistic regression cost function > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> > > > https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > very nice, +1! > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > For the interfaces of the algorithms: > > >> >>> > > > > > > > > I guess we need to get some more experience, I can > > not > > >> >>> tell > > >> >>> > how > > >> >>> > > > the > > >> >>> > > > > > > > > interfaces for them should look like, mainly > > because I > > >> >>> don't > > >> >>> > > know > > >> >>> > > > > how > > >> >>> > > > > > > the > > >> >>> > > > > > > > > BSP version of them will call the algorithm logic. > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > you're right, it's more reasonable to just proceed > > >> bottom - > > >> >>> up > > >> >>> > > with > > >> >>> > > > > > this > > >> >>> > > > > > > as > > >> >>> > > > > > > > we're going to have a clearer idea while developing > > the > > >> >>> > different > > >> >>> > > > > > > > algorithms. > > >> >>> > > > > > > > So for now I'd introduce your library Writables and > > then > > >> >>> > proceed > > >> >>> > > 1 > > >> >>> > > > > step > > >> >>> > > > > > > at > > >> >>> > > > > > > > a time with the more common API. > > >> >>> > > > > > > > Thanks, > > >> >>> > > > > > > > Tommaso > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > But having stable math interfaces is the key > point. > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > 2012/7/9 Tommaso Teofili < > tommaso.teofili@gmail.com > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > Ok, so let's sketch up here what these > interfaces > > >> should > > >> >>> > look > > >> >>> > > > > like. > > >> >>> > > > > > > > > > Any proposal is more than welcome. > > >> >>> > > > > > > > > > Regards, > > >> >>> > > > > > > > > > Tommaso > > >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > 2012/7/7 Thomas Jungblut < > > thomas.jungblut@gmail.com> > > >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > > Looks fine to me. > > >> >>> > > > > > > > > > > The key are the interfaces for learning and > > >> >>> predicting so > > >> >>> > > we > > >> >>> > > > > > should > > >> >>> > > > > > > > > > define > > >> >>> > > > > > > > > > > some vectors and matrices. > > >> >>> > > > > > > > > > > It would be enough to define the algorithms > via > > the > > >> >>> > > > interfaces > > >> >>> > > > > > and > > >> >>> > > > > > > a > > >> >>> > > > > > > > > > > generic BSP should just run them based on the > > given > > >> >>> > input. > > >> >>> > > > > > > > > > > > > >> >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili < > > >> tommaso.teofili@gmail.com> > > >> >>> > > > > > > > > > > > > >> >>> > > > > > > > > > > > Hi all, > > >> >>> > > > > > > > > > > > > > >> >>> > > > > > > > > > > > in my spare time I started writing some > basic > > BSP > > >> >>> based > > >> >>> > > > > machine > > >> >>> > > > > > > > > > learning > > >> >>> > > > > > > > > > > > algorithms for our ml module, now I'm > > wondering, > > >> >>> from a > > >> >>> > > > > design > > >> >>> > > > > > > > point > > >> >>> > > > > > > > > of > > >> >>> > > > > > > > > > > > view, where it'd make sense to put the > > training > > >> >>> data / > > >> >>> > > > model. > > >> >>> > > > > > I'd > > >> >>> > > > > > > > > > assume > > >> >>> > > > > > > > > > > > the obvious answer would be HDFS so this > > makes me > > >> >>> think > > >> >>> > > we > > >> >>> > > > > > should > > >> >>> > > > > > > > > come > > >> >>> > > > > > > > > > > with > > >> >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: > > one > > >> for > > >> >>> > > > learning > > >> >>> > > > > > and > > >> >>> > > > > > > > one > > >> >>> > > > > > > > > > for > > >> >>> > > > > > > > > > > > "predicting" each to be run separately. > > >> >>> > > > > > > > > > > > This would allow to read the training data > > from > > >> >>> HDFS, > > >> >>> > and > > >> >>> > > > > > > > > consequently > > >> >>> > > > > > > > > > > > create a model (also on HDFS) and then the > > >> created > > >> >>> > model > > >> >>> > > > > could > > >> >>> > > > > > be > > >> >>> > > > > > > > > read > > >> >>> > > > > > > > > > > > (again from HDFS) in order to predict an > > output > > >> for > > >> >>> a > > >> >>> > new > > >> >>> > > > > > input. > > >> >>> > > > > > > > > > > > Does that make sense? > > >> >>> > > > > > > > > > > > I'm just wondering what a general purpose > > design > > >> for > > >> >>> > Hama > > >> >>> > > > > based > > >> >>> > > > > > > ML > > >> >>> > > > > > > > > > stuff > > >> >>> > > > > > > > > > > > would look like so this is just to start the > > >> >>> > discussion, > > >> >>> > > > any > > >> >>> > > > > > > > opinion > > >> >>> > > > > > > > > is > > >> >>> > > > > > > > > > > > welcome. > > >> >>> > > > > > > > > > > > > > >> >>> > > > > > > > > > > > Cheers, > > >> >>> > > > > > > > > > > > Tommaso > > >> >>> > > > > > > > > > > > > > >> >>> > > > > > > > > > > > > >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >> > > >> >> > > >> > > >> > > >> > > >> -- > > >> Best Regards, Edward J. Yoon > > >> @eddieyoon > > >> > > > > > > > > -- > > Best Regards, Edward J. Yoon > > @eddieyoon > > > --f46d0442806664bf3f04c48c2907--