Return-Path: X-Original-To: apmail-madlib-dev-archive@minotaur.apache.org Delivered-To: apmail-madlib-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9BF1218FAE for ; Tue, 29 Mar 2016 03:29:38 +0000 (UTC) Received: (qmail 74549 invoked by uid 500); 29 Mar 2016 03:29:38 -0000 Delivered-To: apmail-madlib-dev-archive@madlib.apache.org Received: (qmail 74506 invoked by uid 500); 29 Mar 2016 03:29:38 -0000 Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.incubator.apache.org Delivered-To: mailing list dev@madlib.incubator.apache.org Received: (qmail 74493 invoked by uid 99); 29 Mar 2016 03:29:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2016 03:29:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 22A94C0B54 for ; Tue, 29 Mar 2016 03:29:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.298 X-Spam-Level: * X-Spam-Status: No, score=1.298 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id TlpCd2TIc6WB for ; Tue, 29 Mar 2016 03:29:34 +0000 (UTC) Received: from mail-qg0-f53.google.com (mail-qg0-f53.google.com [209.85.192.53]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 696B75F23C for ; Tue, 29 Mar 2016 03:29:34 +0000 (UTC) Received: by mail-qg0-f53.google.com with SMTP id u110so2561431qge.3 for ; Mon, 28 Mar 2016 20:29:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=3WUm6glFbu0CoE3hdisoWqy/sfAiyJOFs5aIFcopTNw=; b=VbM8iPr24Bsi4n+onq4sPHNcBuvqaPStZNXNJhUVO56z999EjsFwl4seACU/qNHWGJ i0rQyCxFVJ/8nHpgFCDCin3GNITwvd7lwk1L/ix5RS8n9eWXvcmIsweZ9bFJ/miF4tqP +BGHSiciiwosv6erkqc2TaAGm7SG+6857KqZu4WFPuhy9NaixtwwxayNTn2WzN+8mU4z ZsWUHCwj/9EcW4KlVgMks7J875RE5QxY/k1HfHbvQ/6Q161h61s8gI6zF83mMbBY1UvA aHoffl48LKSFbf+QhhCa8HHcM95skgb4txfp/fOPWXPIyxwr3+JOYsFa+c6/eghsqsX2 TIFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=3WUm6glFbu0CoE3hdisoWqy/sfAiyJOFs5aIFcopTNw=; b=GGS1QDI6IynvEK98CtUReNzDzOUjTcQTL1icoRkgsRZBgrSGnIwSD+Qs787rfFHkOc AAqEBVh6o/hFNXUBa0TMAjfYIwZctUvQRq6ym8Be3lGxFMjX5eV6SD4yVPXeZaDK3RVy iLL3K3+kB/CNu+MLY4bEtRCFvp3Yk5YBq+GAw3NkSswKJVzOM+VStIRvW1C1ZQibccHb nmRZwn+T0tRUeAhLh+P5FeU+gNbfT/C7+ZjdB9QZKvBuD2Iebzm9coUZD+vqcFBOLVtk 6cCnwUcmGiAlaDrSJuR6MHLMbQgKKF1e3yBaTGI68wyRv0BZYtLqVniihihFy5fmHsn/ JUTA== X-Gm-Message-State: AD7BkJIyryUeRnN3mQFHaKSkVq3+d5NdhfxtL91nzYH86m7zTI9xY5tKUz3BmZcWnYEAL6cEWsXaC4aFQd3NxrW0 MIME-Version: 1.0 X-Received: by 10.140.177.87 with SMTP id x84mr11915637qhx.39.1459222173510; Mon, 28 Mar 2016 20:29:33 -0700 (PDT) Received: by 10.55.47.199 with HTTP; Mon, 28 Mar 2016 20:29:33 -0700 (PDT) In-Reply-To: References: Date: Mon, 28 Mar 2016 20:29:33 -0700 Message-ID: Subject: Re: Contributing GMM and Perceptron to MADLib From: Frank McQuillan To: Rahul Iyer Cc: Roman Shaposhnik , dev@madlib.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113a75fe4d30aa052f27a3c3 --001a113a75fe4d30aa052f27a3c3 Content-Type: text/plain; charset=UTF-8 Let me figure out how to do this and add Aditya as the owner of that JIRA. My initial attempts in ASF infra-land were not quite successful. Frank On Mon, Mar 28, 2016 at 4:54 PM, Rahul Iyer wrote: > @Frank, Roman: I believe Aditya needs to be added as a developer to the > MADlib project to assign a JIRA to him? Is this only available to the > lead/owner? > > On Mon, Mar 28, 2016 at 3:49 PM, Aditya Nain > wrote: > >> Hi Rahul, >> >> I didn't have an id, so I created one now. >> My id is : Aditya Nain >> >> Thanks, >> Aditya >> >> On Mon, Mar 28, 2016 at 6:40 PM, Rahul Iyer wrote: >> >> > I can assign this to you, but you need to have an account in >> > https://issues.apache.org. >> > If you already have an account, then please send your id - I wasn't >> able to >> > find you just using your name. >> > >> > On Mon, Mar 28, 2016 at 3:31 PM, Aditya Nain >> > wrote: >> > >> > > Hi Rahul, >> > > >> > > Thanks for the reply! >> > > >> > > I am working on implementing Gaussian Mixture Model assuming that the >> > > co-variance matrix is same for all the Gaussians. >> > > The JIRA which deals GMM is MADBLIB-410: >> > > >> > >> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB >> > > >> > > Can this be assigned to me, or how do I get it assigned to me? >> > > >> > > Thanks, >> > > Aditya >> > > >> > > On Mon, Mar 21, 2016 at 3:41 PM, Rahul Iyer wrote: >> > > >> > > > Hi Aditya, >> > > > >> > > > Welcome to the MADlib community! >> > > > >> > > > Gaussian Mixture models is extrememly useful and we would heartily >> > > welcome >> > > > a contribution for it. The SQLEM paper might be oversimplifying the >> > > > capabilities of the database (e.g. assuming there is no array type >> is >> > > > unnecessary for Postgresql). You could speed things (both dev time >> and >> > > > execution time) by writing some of the functions in C++. K-means is >> an >> > > > example of how clustering is implemented. >> > > > IMO, assuming the same covariance matrix is reasonable. We could >> extend >> > > the >> > > > capabilities after the initial implementation is complete. >> > > > >> > > > There was some work started a long time ago that built perceptrons >> > using >> > > > the convex framework (link < >> https://github.com/iyerr3/madlib/tree/mlp >> > >). >> > > > There are still some bugs in that code since the trained network >> isn't >> > > > converging. You could start there or build a new module - either >> ways >> > an >> > > > MLP module is frequently demanded by the data science community. >> > > > >> > > > I would suggest starting with Gaussian mixtures and then moving to >> > > > perceptrons if GMM work is completed. >> > > > >> > > > Feel free to ask questions on this forum. Looking forward to >> > > collaborating >> > > > with you. >> > > > >> > > > Best, >> > > > Rahul >> > > > >> > > > On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain > > >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > My name is Aditya Nain, and I am a graduate student at University >> of >> > > > > Florida. >> > > > > I have been learning MADLib for a while and want to contribute to >> > > MADLib. >> > > > > I went through some of the open stories in JIRA and started >> working >> > on >> > > > > MADLIB-410 : >> > > > > >> > > > > >> > > > >> > > >> > >> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB >> > > > > >> > > > > which is about implementing Gaussian Mixture Model using >> Expectation >> > > > > Maximization (EM) algorithm. >> > > > > >> > > > > I came across the following paper while searching for distributed >> EM >> > > > > algorithm which can be implemented in MADLib. >> > > > > >> > > > > Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL >> using >> > the >> > > > EM >> > > > > algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages >> > > 559-570. >> > > > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564 >> > > > > >> > > > > I thought of implementing the approach discussed in the paper, but >> > the >> > > > > paper makes an assumption that the covariance martix is the same >> for >> > > all >> > > > > the clusters ( i.e covariance matrix is same for all the Gaussian >> > > > > distributions). So, I wanted to know the opinion of the community >> if >> > > it's >> > > > > fine to go with the assumption made in the paper and implement it >> in >> > > > > MADLib. >> > > > > >> > > > > Also, currently MADLib doesn't have an implementation of a >> > perceptron, >> > > > nor >> > > > > did I find any open story related to it in JIRA. I came across the >> > > > > following paper, which talks about a distributed algorithm for >> > > > perceptron : >> > > > > >> > > > > Ryan McDonald, Keith Hall, Gideon Mann "Distributed training >> > strategies >> > > > for >> > > > > the structured perceptron" >> > > > > http://dl.acm.org/citation.cfm?id=1858068 >> > > > > >> > > > > Would it useful to have a distributed implementaion of perceptron >> in >> > > > > MADlib? >> > > > > >> > > > > Thanks, >> > > > > Aditya >> > > > > >> > > > >> > > >> > >> > > --001a113a75fe4d30aa052f27a3c3--