Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Developers List" <dev@commons.apache.org>
Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com
 designates 74.125.46.155 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=Aj1ICqEkqgAeSCebbR81N77gAZh0+h81476onavVMGZWI/UXNNef/J7ZV1gptKReNh
         irdeSjA3YfgOYZJXuh5fNFnlZtdCXKWiw0c3q0h5Tdj24oXR+fEqVvPLPU0OCe9UO9ND
         yuoX5mwcD9ih8EjeCKdkkyF6kHe72PS68KR8k=
MIME-Version: 1.0
In-Reply-To: <49A6F833.5080009@free.fr>
References: <49A6EECD.1030803@free.fr>
	 <c7d45fc70902261159xfbb966dt69ce302052197f2b@mail.gmail.com>
	 <49A6F833.5080009@free.fr>
Date: Thu, 26 Feb 2009 15:21:55 -0800
Message-ID: <c7d45fc70902261521u10b0f22bx5b31f703172ad4b4@mail.gmail.com>
Subject: Re: [math] redesigning the optimization/estimation packages (phase 2)
From: Ted Dunning <ted.dunning@gmail.com>
To: Commons Developers List <dev@commons.apache.org>
Content-Type: multipart/alternative; boundary=0016361e892a3766960463da9ee0

--0016361e892a3766960463da9ee0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Here is what I would do in R to do a linear model with observations
consisting of input variables x1 ... x3 and target variable y:

     m <- lm(y ~ x1 + x2 + x3, data)

The first argument lets me supply the form of the regression.  I could have
regressed on log(x3), for instance, or used x1*x2 + x3 to include x1, x2, x3
and the x1:x2 interaction or (x1+x2+x3)^2.

To do logistic regression, I would switch to glm (for generalized linear
model):

     m <- glm(y ~ x1 + x2 + x3, data, family=binomial())

To do Bayesian logistic regression, I would use bayesglm from the arm
library:

    # this happens once to install the package
    install.packages("arm");
    # this happens once per session
    library(arm)
    # and this gives me my model
    m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial())

To give weights to observations using the k column of the data:

    m <- bayesglm(y ~ x1 + x2 + x3, data, family=binomial(), weights=data$k)

If I want to compute the posterior distribution of outputs, I can use
sims(m)$coefs to get a bunch of samples from the coefficient posterior and
multiply them by the observation matrix in one or two (fairly simple) lines
of code.  If I just want outputs for MAP model estimates, then I do

    ytwiddle = predict(m, newdata)

This works for any of the models above.

If I could do anything like this using commons math in less than 10x as many
lines of code, I would be thrilled.

Some notes about the user experience here:

a) I don't have to care about residuals, but if I do care, they are in the
resulting model structure and I can look at them.  If I do plot(m) I get a
bunch of interesting diagnostic plots.  If I do coefplot(m), I get my
coefficients with error bars.

b) I don't care *how* my weights affect the algorithm.  It just works.  I
can dig into the guts in various ways, but I don't have to.

c) I can effectively subset my data column-wise without creating new
matrices.  I can also change the structure of my model all over the place
very easily without changing the data.  This is (was) a controversial aspect
of R's modeling structure and it makes it really hard to find logistic
regression in the index of a book because the libraries are oriented around
the generalized linear modeling view of the world (which isn't so bad after
you get going with it).

d) lots of diagnostic info gets embedded into the resulting model.


On Thu, Feb 26, 2009 at 12:14 PM, Luc Maisonobe <Luc.Maisonobe@free.fr>wrote:

> This way, the residual would not be seen and only the
> pure observation vectors should be provided by user.
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

--0016361e892a3766960463da9ee0--