mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Recommeding on Dynamic Content
Date Thu, 03 Feb 2011 06:08:31 GMT
Yahoo is building what they say 2-stage hierarchical model. I am not
arguing that they use EM etc. to solve individual stages. I understand
that. I am not arguing that they are primarily motivated by solving
cold start problem. I understand that as well.

but what they build is similar reasoning, if not the same, as here : Is it not? It is
possible i am mixing things here, this hierarchy is not directly
Bayesian, but motivation here is similar?

I am just saying that we can generalize problem to hierarchies that
don't have to be 2-stage. That's all.
I am also saying that a practical problem i have at hand is also more
than 2 stage. I don't know what would be the best way to solve it. But
it seems to me that hierarchical learning analogous to these could be
extended to a more general case with multiple hierarchies on the side
info or even user/item content profiles.

For example, say sometimes user & item interact and you always know
time of the day when it happen. (just sheer example). but sometimes
(far from always) you also happen to know the weather. or/and Geo
where it happen. Can't we make use of that information with an
addiiton of another stage to the hierarchy?

On Wed, Feb 2, 2011 at 8:54 PM, Dmitriy Lyubimov <> wrote:
> I am basically retracing generalization of the Bayesian inference
> problem given in Yahoo paper. I am too lazy to go back for a quote.
>  The SVD problem was discussed at meetups, basically the criticism
> here is that for RxC matrix whenever there's a missing measurement,
> one can't specify 'no measurement' but rather have to leave it at some
> neutral value (0? average?) which is essentially nothing but a noise
> since it's not a sample. As one guy from Stanford demonstrated on
> Netflix data, the whole system collapses very quickly after certain
> threshold of sample sparsity is reached.
> On Wed, Feb 2, 2011 at 7:20 PM, Ted Dunning <> wrote:
>> Dmitriy,
>> I am not clear what you are saying entirely, but as far as I can understand
>> your points, I think I disagree.  Of course, if I don't catch your drift, I
>> might be wrong and we might be in agreement.
>> On Wed, Feb 2, 2011 at 2:43 PM, Dmitriy Lyubimov <> wrote:
>>> both Elkan's work and Yahoo's paper are based on the notion (which is
>>> confirmed by SGD experience) that if we try to substitute missing data with
>>> neutral values, the whole learning falls apart. Sort of.
>> I don't see why you say that.  Elkan and Yahoo want to avoid the cold start
>> process by using user and item offsets and by using latent factors to smooth
>> the recommendation process.
>>> I.e. if we always know some context A (in this case, static labels and
>>> dyadic ids) and only sometimes some context B, then assuming neutral values
>>> for context B if we are missing this data is invalid because we are actually
>>> substituting unknown data with made-up data.
>> This is abstract that I don't know what you are referring to really.  Yes,
>> static characteristics will be used if they are available and latent factors
>> will be used if they are available.
>>> Which is why SGD produces higher errors than necessary on sparsified label
>>> data. this is also the reason why SVD recommenders produce higher errors
>>> over sparse sample data as well (i think that's  the consensus).
>> I don't think I am part of that consensus.
>> SGD produces very low errors when used with sparse data.  But it can also
>> use non-sparse features just as well.  Why do you mean "higher errors than
>> necessary"?  That lower error rates are possible with latent factor
>> techniques?
>>> However, thinking in offline-ish mode, if we learn based on samples with A
>>> data, then freeze the learner and learn based on error between frozen
>>> learner for A and only the input that has context B, for learner B, then we
>>> are not making the mistake per above. At no point our learner takes any
>>> 'made-up' data.
>> Are you talking about the alternating learning process in Menon and Elkan?
>>> This whole notion is based on Bayesian inference process: what can you say
>>> if you only know A; and what correction would you make if you also new B.
>> ?!??
>> The process is roughly analogous to an EM algorithm, but not very.
>>> Both papers do a corner case out of this: we have two types of data, A and
>>> B, and we learn A then freeze leaner A, then learn B where available.
>>> But general case doesn't have to be A and B. Actually that's our case (our
>>> CEO calls it 'trunk-brunch-leaf' case): We always know some context A, and
>>> sometimes B, and also sometimes we know all of A, B and some addiional
>>> context C.
>>> so there's a case to be made to generalize the inference architecture:
>>> specify hierarchy and then learn A/B/C, SGD+loglinear, or whatever else.
>> I think that these analogies are very strained.

View raw message