Return-Path: Delivered-To: apmail-commons-dev-archive@www.apache.org Received: (qmail 51722 invoked from network); 26 Feb 2009 23:22:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Feb 2009 23:22:27 -0000 Received: (qmail 55917 invoked by uid 500); 26 Feb 2009 23:22:25 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 55839 invoked by uid 500); 26 Feb 2009 23:22:25 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 55828 invoked by uid 99); 26 Feb 2009 23:22:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Feb 2009 15:22:25 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 74.125.46.155 as permitted sender) Received: from [74.125.46.155] (HELO yw-out-1718.google.com) (74.125.46.155) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Feb 2009 23:22:16 +0000 Received: by yw-out-1718.google.com with SMTP id 6so568115ywa.60 for ; Thu, 26 Feb 2009 15:21:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=NQyjyCi0bFXTv61ILqrb1Nls0DknRAKm6rsh/AbcXDc=; b=tYr6nNcAnBXGDvSgtw6BpyeIQOcWcppLYWHiC09rNV3Pcha30m2nrD2804roP199Q2 8QOqrX5gDG745sBQlYefBxiVp5JlBvdjtNKhmMtPwZDFhLPwd0fIb9G4rm4q5Tub9bbp 9DHic21fqS0LVtoUEBPgvIusMA375NZj3eDLw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Aj1ICqEkqgAeSCebbR81N77gAZh0+h81476onavVMGZWI/UXNNef/J7ZV1gptKReNh irdeSjA3YfgOYZJXuh5fNFnlZtdCXKWiw0c3q0h5Tdj24oXR+fEqVvPLPU0OCe9UO9ND yuoX5mwcD9ih8EjeCKdkkyF6kHe72PS68KR8k= MIME-Version: 1.0 Received: by 10.90.74.7 with SMTP id w7mr301932aga.17.1235690515293; Thu, 26 Feb 2009 15:21:55 -0800 (PST) In-Reply-To: <49A6F833.5080009@free.fr> References: <49A6EECD.1030803@free.fr> <49A6F833.5080009@free.fr> Date: Thu, 26 Feb 2009 15:21:55 -0800 Message-ID: Subject: Re: [math] redesigning the optimization/estimation packages (phase 2) From: Ted Dunning To: Commons Developers List Content-Type: multipart/alternative; boundary=0016361e892a3766960463da9ee0 X-Virus-Checked: Checked by ClamAV on apache.org --0016361e892a3766960463da9ee0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Here is what I would do in R to do a linear model with observations consisting of input variables x1 ... x3 and target variable y: m <- lm(y ~ x1 + x2 + x3, data) The first argument lets me supply the form of the regression. I could have regressed on log(x3), for instance, or used x1*x2 + x3 to include x1, x2, x3 and the x1:x2 interaction or (x1+x2+x3)^2. To do logistic regression, I would switch to glm (for generalized linear model): m <- glm(y ~ x1 + x2 + x3, data, family=binomial()) To do Bayesian logistic regression, I would use bayesglm from the arm library: # this happens once to install the package install.packages("arm"); # this happens once per session library(arm) # and this gives me my model m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial()) To give weights to observations using the k column of the data: m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial(), weights=data$k) If I want to compute the posterior distribution of outputs, I can use sims(m)$coefs to get a bunch of samples from the coefficient posterior and multiply them by the observation matrix in one or two (fairly simple) lines of code. If I just want outputs for MAP model estimates, then I do ytwiddle = predict(m, newdata) This works for any of the models above. If I could do anything like this using commons math in less than 10x as many lines of code, I would be thrilled. Some notes about the user experience here: a) I don't have to care about residuals, but if I do care, they are in the resulting model structure and I can look at them. If I do plot(m) I get a bunch of interesting diagnostic plots. If I do coefplot(m), I get my coefficients with error bars. b) I don't care *how* my weights affect the algorithm. It just works. I can dig into the guts in various ways, but I don't have to. c) I can effectively subset my data column-wise without creating new matrices. I can also change the structure of my model all over the place very easily without changing the data. This is (was) a controversial aspect of R's modeling structure and it makes it really hard to find logistic regression in the index of a book because the libraries are oriented around the generalized linear modeling view of the world (which isn't so bad after you get going with it). d) lots of diagnostic info gets embedded into the resulting model. On Thu, Feb 26, 2009 at 12:14 PM, Luc Maisonobe wrote: > This way, the residual would not be seen and only the > pure observation vectors should be provided by user. > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax) --0016361e892a3766960463da9ee0--