Hi all,
Dimitry, I guess you are talking about this paper of Andrea Montanari, am i
correct?
Matrix Completion from Noisy Entries. http://arxiv.org/abs/0906.2027v1
2011/2/3 Dmitriy Lyubimov <dlieu.7@gmail.com>
> I am basically retracing generalization of the Bayesian inference
> problem given in Yahoo paper. I am too lazy to go back for a quote.
>
> The SVD problem was discussed at meetups, basically the criticism
> here is that for RxC matrix whenever there's a missing measurement,
> one can't specify 'no measurement' but rather have to leave it at some
> neutral value (0? average?) which is essentially nothing but a noise
> since it's not a sample. As one guy from Stanford demonstrated on
> Netflix data, the whole system collapses very quickly after certain
> threshold of sample sparsity is reached.
>
> On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> > Dmitriy,
> > I am not clear what you are saying entirely, but as far as I can
> understand
> > your points, I think I disagree. Of course, if I don't catch your drift,
> I
> > might be wrong and we might be in agreement.
> >
> > On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>
> >> both Elkan's work and Yahoo's paper are based on the notion (which is
> >> confirmed by SGD experience) that if we try to substitute missing data
> with
> >> neutral values, the whole learning falls apart. Sort of.
> >
> > I don't see why you say that. Elkan and Yahoo want to avoid the cold
> start
> > process by using user and item offsets and by using latent factors to
> smooth
> > the recommendation process.
> >
> >>
> >> I.e. if we always know some context A (in this case, static labels and
> >> dyadic ids) and only sometimes some context B, then assuming neutral
> values
> >> for context B if we are missing this data is invalid because we are
> actually
> >> substituting unknown data with madeup data.
> >
> > This is abstract that I don't know what you are referring to really.
> Yes,
> > static characteristics will be used if they are available and latent
> factors
> > will be used if they are available.
> >
> >>
> >> Which is why SGD produces higher errors than necessary on sparsified
> label
> >> data. this is also the reason why SVD recommenders produce higher errors
> >> over sparse sample data as well (i think that's the consensus).
> >
> > I don't think I am part of that consensus.
> > SGD produces very low errors when used with sparse data. But it can also
> > use nonsparse features just as well. Why do you mean "higher errors
> than
> > necessary"? That lower error rates are possible with latent factor
> > techniques?
> >
> >>
> >> However, thinking in offlineish mode, if we learn based on samples with
> A
> >> data, then freeze the learner and learn based on error between frozen
> >> learner for A and only the input that has context B, for learner B, then
> we
> >> are not making the mistake per above. At no point our learner takes any
> >> 'madeup' data.
> >
> > Are you talking about the alternating learning process in Menon and
> Elkan?
> >
> >>
> >> This whole notion is based on Bayesian inference process: what can you
> say
> >> if you only know A; and what correction would you make if you also new
> B.
> >
> > ?!??
> > The process is roughly analogous to an EM algorithm, but not very.
> >
> >>
> >> Both papers do a corner case out of this: we have two types of data, A
> and
> >> B, and we learn A then freeze leaner A, then learn B where available.
> >>
> >> But general case doesn't have to be A and B. Actually that's our case
> (our
> >> CEO calls it 'trunkbrunchleaf' case): We always know some context A,
> and
> >> sometimes B, and also sometimes we know all of A, B and some addiional
> >> context C.
> >>
> >> so there's a case to be made to generalize the inference architecture:
> >> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else.
> >
> > I think that these analogies are very strained.
> >
> >
>
