Return-Path: X-Original-To: apmail-madlib-dev-archive@minotaur.apache.org Delivered-To: apmail-madlib-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9311718B9F for ; Mon, 21 Mar 2016 19:41:29 +0000 (UTC) Received: (qmail 25316 invoked by uid 500); 21 Mar 2016 19:41:29 -0000 Delivered-To: apmail-madlib-dev-archive@madlib.apache.org Received: (qmail 25266 invoked by uid 500); 21 Mar 2016 19:41:29 -0000 Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.incubator.apache.org Delivered-To: mailing list dev@madlib.incubator.apache.org Received: (qmail 25254 invoked by uid 99); 21 Mar 2016 19:41:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Mar 2016 19:41:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A42BD18046D for ; Mon, 21 Mar 2016 19:41:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id MQ-Geiw8Ei81 for ; Mon, 21 Mar 2016 19:41:26 +0000 (UTC) Received: from mail-yw0-f173.google.com (mail-yw0-f173.google.com [209.85.161.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 58F885F33E for ; Mon, 21 Mar 2016 19:41:26 +0000 (UTC) Received: by mail-yw0-f173.google.com with SMTP id h65so84922969ywe.0 for ; Mon, 21 Mar 2016 12:41:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=tyOwlXPDx/lD9Dpi42rr0CwVS7aH7MguE7zGwInOct8=; b=ByukH+kpf1iWL+HDTFPBg62MWU+oEvKn5Nm4wAkA6/X9rXATVZnVSRGgkFpFdK6sou Ca9bf+dR9mNDzYWD3DEZ+PwMN8msIrVxj6Mjx2X8gJcxAxfOouorhwegrLsDzAnTL6aW U3W55tlA1mSP6grlcS1su+HzMuQOUomooZ1p4MPnfyQHhytG1RxUXqDTrcMwofd79Xin cQjma/zQQ8Y7zt7b5wLeiAEQgR3LlngOnu93FEziz38HD33RSotYuNUkls3NDN1VeQeC SBdznKD1e/YFgIHDLewvfzH+d3W8nAxv7cNjI3RP8tOZpfVSSdKj+PpiBhPzTD+IZqnp o9eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=tyOwlXPDx/lD9Dpi42rr0CwVS7aH7MguE7zGwInOct8=; b=jQfQZ8ifGNcT5bny/8utvzB6UMGHB1Dijiz7UrWzZL/Cgu6KQeAJPXl4GN6191U/7E PXCWnZQZKvgE8kERXCb1pMWVCNi6VhZaYTsEc5kEseWGY5MYduBw8wbv2W3PtQfC/gX3 rRfqJ5Nlyw0OlnUk77uDJdZvBd/Ex/smz9hbXN/oM1w8ztoi4gOIxE7tEqAy8Usl+SfX 1XOwnQw6lofsqqTrHustpUyPr1ZBwjKjO2XIIFionPYMs2DpQsEJSe3rLFPFpBFN2p65 opVZBQURdm4sSG9jnZ5wYWuRw61L4wLETmqafPvfG/FeN3nJHJuJzYZvUApiH6z/G/v3 2ZrQ== X-Gm-Message-State: AD7BkJLU7Y3ONRcccugd242mtKU+oaTh9i/J5edpNVLjE0YecDAbs8srF00rQarEIoJhHI9Yi1truqTb2lvFpdbz X-Received: by 10.129.80.135 with SMTP id e129mr8837170ywb.197.1458589285576; Mon, 21 Mar 2016 12:41:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.3.147 with HTTP; Mon, 21 Mar 2016 12:41:06 -0700 (PDT) In-Reply-To: References: From: Rahul Iyer Date: Mon, 21 Mar 2016 12:41:06 -0700 Message-ID: Subject: Re: Contributing GMM and Perceptron to MADLib To: dev@madlib.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1147ed723da400052e94489b --001a1147ed723da400052e94489b Content-Type: text/plain; charset=UTF-8 Hi Aditya, Welcome to the MADlib community! Gaussian Mixture models is extrememly useful and we would heartily welcome a contribution for it. The SQLEM paper might be oversimplifying the capabilities of the database (e.g. assuming there is no array type is unnecessary for Postgresql). You could speed things (both dev time and execution time) by writing some of the functions in C++. K-means is an example of how clustering is implemented. IMO, assuming the same covariance matrix is reasonable. We could extend the capabilities after the initial implementation is complete. There was some work started a long time ago that built perceptrons using the convex framework (link ). There are still some bugs in that code since the trained network isn't converging. You could start there or build a new module - either ways an MLP module is frequently demanded by the data science community. I would suggest starting with Gaussian mixtures and then moving to perceptrons if GMM work is completed. Feel free to ask questions on this forum. Looking forward to collaborating with you. Best, Rahul On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain wrote: > Hi, > > My name is Aditya Nain, and I am a graduate student at University of > Florida. > I have been learning MADLib for a while and want to contribute to MADLib. > I went through some of the open stories in JIRA and started working on > MADLIB-410 : > > https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB > > which is about implementing Gaussian Mixture Model using Expectation > Maximization (EM) algorithm. > > I came across the following paper while searching for distributed EM > algorithm which can be implemented in MADLib. > > Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL using the EM > algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages 559-570. > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564 > > I thought of implementing the approach discussed in the paper, but the > paper makes an assumption that the covariance martix is the same for all > the clusters ( i.e covariance matrix is same for all the Gaussian > distributions). So, I wanted to know the opinion of the community if it's > fine to go with the assumption made in the paper and implement it in > MADLib. > > Also, currently MADLib doesn't have an implementation of a perceptron, nor > did I find any open story related to it in JIRA. I came across the > following paper, which talks about a distributed algorithm for perceptron : > > Ryan McDonald, Keith Hall, Gideon Mann "Distributed training strategies for > the structured perceptron" > http://dl.acm.org/citation.cfm?id=1858068 > > Would it useful to have a distributed implementaion of perceptron in > MADlib? > > Thanks, > Aditya > --001a1147ed723da400052e94489b--