mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Confidence interval for logistic regression
Date Thu, 31 Mar 2011 21:16:42 GMT
Ted, thank you very much.

Let me check the references.

On Thu, Mar 31, 2011 at 1:53 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>
> On Thu, Mar 31, 2011 at 11:21 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>>
>> Thank you, Ted.
>>
>> (i think MAP is another way to say 'distribution mode' for the entire
>> training set?).
>
> Not quite.  That would be maximum likelihood.
> MAP is the distribution mode for the likelihood times the prior (aka the
> posterior distribution).  For the uniform prior, these are the same.  Of
> course, the uniform prior is often nonsensical mathematically.
>
>>
>> I think i am talking about uncertainty of the result, not the
>> parameters...
>
> Sure.  But the posterior of the parameters leads to the posterior of the
> result.
> The real problem here is that you often have strong interactions in the
> parameters that will lead to the same result.
> For instance, if you have one predictor variable repeated in your input, you
> have the worst case of co-linearity.  The L_1 regularized SGD will be unable
> to pick either variable, but the sum of the weights on the two variables
> will be constant and your predicted value could be perfectly accurate even
> though the parameters each separately appear to be uncertain.  The actual
> posterior for the parameter space is a highly correlated distribution.  The
> problem here is that the correlation matrix has n^2 elements (though it is
> sparse) which makes the computation of correlation difficult if only because
> over-fitting is even worse for n^2 elements than for n.
> Without some decent correlation estimate, you can't get a good error
> estimate on the result.
> So that is why the problem is hard.
> Here are a few possible solutions:
> a) on-line bootstrapping
> One thing that might work is to clone the CrossFoldLearner to make a
> bootstrap learner.  Each training example would be passed to a subset of the
> different sub-learners at random.  The goal would be to approximate
> resampling.  Then you would look at the diversity of opinion between the
> different classifiers.
> This has some theoretical difficulties with the relationship between what a
> real resampling would do and what this does.  Also, you are giving the
> learners less data so that isn't going to estimate the error you would get
> with all the data.
>>
>> We want to say how confident the regressed value is. Or sort of what
>> the standard variance in the vicinity of X on average were in the
>> training set.
>>
>> Say we have a traditional PPC (pay per click) advertising. So we might
>> use a binomial regression to compute CTR prediction (prob of a
>> click-thru).
>
> Sure.
> And, in general, you don't want variance, but instead want to be able to
> sample the posterior.  Then you can sample the posterior estimate of regret
> for all of your models and ads and decide which one to pick. This is
> delicate because you need a realistic prior to avoid blue-sky bets all the
> time.  There is a major role in this for multi-level models so that you get
> well-founded priors.
>>
>> Than we could just multiply that by expectation of what click is worth
>> (obtained thru some bidding system) and hence obtain expectation of a
>> particluar ad payoff in a given stituation.
>
> Payoff is only part of the problem, of course, because you really have a
> bandit problem here.  You need to model payoff and opportunity cost of
> making the wrong decision now, but also to incorporate the estimated benefit
> that learning about a model might have.  Again, strong priors are very
> valuable in this.
>>
>> But there's a difference in situation when we say "rev(A)=5c +- 0.1c'
>> and 'rev(A)=5c+-2c' because in the first case we pretty much damn sure
>> B is almost always better than A and in the second case we just say
>> 'or, they are both about the same, so just rotate them".
>
> This is regret.  What is the expectation of opportunity cost.  Or
>    C_j =    \int_Y \left[ max (Y) - y_j \right] dP(Y)
> where Y is the vector of payoffs and P(Y) is the multi-dimensional
> cumulative distribution of same.
>>
>> So one way to go about this I see is if we have regression of the mode
>> of posterrior
>> http://latex.codecogs.com/gif.latex?y=\hat{y}\left(\boldsymbol{z},\boldsymbol{\beta}\right)
>> then say we want to estimate 'variance'in the vicinity' by building a
>> regression for another target set composed of squares of errors
>>
>> http://latex.codecogs.com/gif.latex?\left(y-\hat{y}\right)^{2}=\hat{s}\left(\boldsymbol{x},\boldsymbol{\beta}\right)
>>
>> and that would give us much more leverage when comparing ad
>> performance. In other words, we want more handle on questions like
>> 'how often ad A is better performing than an ad b?)
>
> This is Laplace's method for estimating posterior distributions.  See
> http://www.inference.phy.cam.ac.uk/mackay/laplace.pdf
> for instance and the 1992 paper of his own that he cites on the first page.
>  Mackay's book is excellent on this and related topics.  See
> http://www.inference.phy.cam.ac.uk/itprnn/book.html
>
> These methods can be fruitful, but I don't know how to implement them in the
> presence of big data (i.e. in an on-line learner).  With small data, the
> bayesglm package in R may be helpful.   See
> http://www.stat.columbia.edu/~gelman/research/unpublished/priors7.pdf
> for more information.  I have used bayesglm in smaller data situations with
> very good results.
>

Mime
View raw message