I am basically retracing generalization of the Bayesian inference
problem given in Yahoo paper. I am too lazy to go back for a quote.
The SVD problem was discussed at meetups, basically the criticism
here is that for RxC matrix whenever there's a missing measurement,
one can't specify 'no measurement' but rather have to leave it at some
neutral value (0? average?) which is essentially nothing but a noise
since it's not a sample. As one guy from Stanford demonstrated on
Netflix data, the whole system collapses very quickly after certain
threshold of sample sparsity is reached.
On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Dmitriy,
> I am not clear what you are saying entirely, but as far as I can understand
> your points, I think I disagree. Of course, if I don't catch your drift, I
> might be wrong and we might be in agreement.
>
> On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>> both Elkan's work and Yahoo's paper are based on the notion (which is
>> confirmed by SGD experience) that if we try to substitute missing data with
>> neutral values, the whole learning falls apart. Sort of.
>
> I don't see why you say that. Elkan and Yahoo want to avoid the cold start
> process by using user and item offsets and by using latent factors to smooth
> the recommendation process.
>
>>
>> I.e. if we always know some context A (in this case, static labels and
>> dyadic ids) and only sometimes some context B, then assuming neutral values
>> for context B if we are missing this data is invalid because we are actually
>> substituting unknown data with madeup data.
>
> This is abstract that I don't know what you are referring to really. Yes,
> static characteristics will be used if they are available and latent factors
> will be used if they are available.
>
>>
>> Which is why SGD produces higher errors than necessary on sparsified label
>> data. this is also the reason why SVD recommenders produce higher errors
>> over sparse sample data as well (i think that's the consensus).
>
> I don't think I am part of that consensus.
> SGD produces very low errors when used with sparse data. But it can also
> use nonsparse features just as well. Why do you mean "higher errors than
> necessary"? That lower error rates are possible with latent factor
> techniques?
>
>>
>> However, thinking in offlineish mode, if we learn based on samples with A
>> data, then freeze the learner and learn based on error between frozen
>> learner for A and only the input that has context B, for learner B, then we
>> are not making the mistake per above. At no point our learner takes any
>> 'madeup' data.
>
> Are you talking about the alternating learning process in Menon and Elkan?
>
>>
>> This whole notion is based on Bayesian inference process: what can you say
>> if you only know A; and what correction would you make if you also new B.
>
> ?!??
> The process is roughly analogous to an EM algorithm, but not very.
>
>>
>> Both papers do a corner case out of this: we have two types of data, A and
>> B, and we learn A then freeze leaner A, then learn B where available.
>>
>> But general case doesn't have to be A and B. Actually that's our case (our
>> CEO calls it 'trunkbrunchleaf' case): We always know some context A, and
>> sometimes B, and also sometimes we know all of A, B and some addiional
>> context C.
>>
>> so there's a case to be made to generalize the inference architecture:
>> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else.
>
> I think that these analogies are very strained.
>
>
