Ted, thank you very much.
Let me check the references.
On Thu, Mar 31, 2011 at 1:53 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>
> On Thu, Mar 31, 2011 at 11:21 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>>
>> Thank you, Ted.
>>
>> (i think MAP is another way to say 'distribution mode' for the entire
>> training set?).
>
> Not quite. That would be maximum likelihood.
> MAP is the distribution mode for the likelihood times the prior (aka the
> posterior distribution). For the uniform prior, these are the same. Of
> course, the uniform prior is often nonsensical mathematically.
>
>>
>> I think i am talking about uncertainty of the result, not the
>> parameters...
>
> Sure. But the posterior of the parameters leads to the posterior of the
> result.
> The real problem here is that you often have strong interactions in the
> parameters that will lead to the same result.
> For instance, if you have one predictor variable repeated in your input, you
> have the worst case of colinearity. The L_1 regularized SGD will be unable
> to pick either variable, but the sum of the weights on the two variables
> will be constant and your predicted value could be perfectly accurate even
> though the parameters each separately appear to be uncertain. The actual
> posterior for the parameter space is a highly correlated distribution. The
> problem here is that the correlation matrix has n^2 elements (though it is
> sparse) which makes the computation of correlation difficult if only because
> overfitting is even worse for n^2 elements than for n.
> Without some decent correlation estimate, you can't get a good error
> estimate on the result.
> So that is why the problem is hard.
> Here are a few possible solutions:
> a) online bootstrapping
> One thing that might work is to clone the CrossFoldLearner to make a
> bootstrap learner. Each training example would be passed to a subset of the
> different sublearners at random. The goal would be to approximate
> resampling. Then you would look at the diversity of opinion between the
> different classifiers.
> This has some theoretical difficulties with the relationship between what a
> real resampling would do and what this does. Also, you are giving the
> learners less data so that isn't going to estimate the error you would get
> with all the data.
>>
>> We want to say how confident the regressed value is. Or sort of what
>> the standard variance in the vicinity of X on average were in the
>> training set.
>>
>> Say we have a traditional PPC (pay per click) advertising. So we might
>> use a binomial regression to compute CTR prediction (prob of a
>> clickthru).
>
> Sure.
> And, in general, you don't want variance, but instead want to be able to
> sample the posterior. Then you can sample the posterior estimate of regret
> for all of your models and ads and decide which one to pick. This is
> delicate because you need a realistic prior to avoid bluesky bets all the
> time. There is a major role in this for multilevel models so that you get
> wellfounded priors.
>>
>> Than we could just multiply that by expectation of what click is worth
>> (obtained thru some bidding system) and hence obtain expectation of a
>> particluar ad payoff in a given stituation.
>
> Payoff is only part of the problem, of course, because you really have a
> bandit problem here. You need to model payoff and opportunity cost of
> making the wrong decision now, but also to incorporate the estimated benefit
> that learning about a model might have. Again, strong priors are very
> valuable in this.
>>
>> But there's a difference in situation when we say "rev(A)=5c + 0.1c'
>> and 'rev(A)=5c+2c' because in the first case we pretty much damn sure
>> B is almost always better than A and in the second case we just say
>> 'or, they are both about the same, so just rotate them".
>
> This is regret. What is the expectation of opportunity cost. Or
> C_j = \int_Y \left[ max (Y)  y_j \right] dP(Y)
> where Y is the vector of payoffs and P(Y) is the multidimensional
> cumulative distribution of same.
>>
>> So one way to go about this I see is if we have regression of the mode
>> of posterrior
>> http://latex.codecogs.com/gif.latex?y=\hat{y}\left(\boldsymbol{z},\boldsymbol{\beta}\right)
>> then say we want to estimate 'variance'in the vicinity' by building a
>> regression for another target set composed of squares of errors
>>
>> http://latex.codecogs.com/gif.latex?\left(y\hat{y}\right)^{2}=\hat{s}\left(\boldsymbol{x},\boldsymbol{\beta}\right)
>>
>> and that would give us much more leverage when comparing ad
>> performance. In other words, we want more handle on questions like
>> 'how often ad A is better performing than an ad b?)
>
> This is Laplace's method for estimating posterior distributions. See
> http://www.inference.phy.cam.ac.uk/mackay/laplace.pdf
> for instance and the 1992 paper of his own that he cites on the first page.
> Mackay's book is excellent on this and related topics. See
> http://www.inference.phy.cam.ac.uk/itprnn/book.html
>
> These methods can be fruitful, but I don't know how to implement them in the
> presence of big data (i.e. in an online learner). With small data, the
> bayesglm package in R may be helpful. See
> http://www.stat.columbia.edu/~gelman/research/unpublished/priors7.pdf
> for more information. I have used bayesglm in smaller data situations with
> very good results.
>
