mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Question about spark-itemsimilarity
Date Thu, 15 Dec 2016 02:23:25 GMT
Cross-occurrence allows us to ask the question: are 2 events correlated. 

To use the Ecom example, purchase is the conversion or primary action, a detail page view
might be related but we must test each cross-occurrence to make sure. I know for a fact that
with many ecom datasets it is impossible to treat these events as the same thing and get anything
but a drop in quality of recommendations (I’ve tested this). People that use the ALS recommender
in Spark’s MLlib sometimes tell you to weight the view less than the purchase. But this
is nonsense (again I’ve tested this). What is true is that *some* views lead to purchases
and others do not. So treating them all with the same weight is pure garbage.

What CCO does is find the views that seem to lead to purchase. It can also find category-preferences
that lead to certain purchases, as well as location-preference (triggered by a purchase when
logged in from some location).  And so on. Just about anything you know about users or can
phrase as a possible indicator of user taste can be used to get lift in quality of recommendation.

So in your example below purchase history is the conversion action, likes, and downloads are
secondary actions looked at as cross-occurrences. Note that we don’t need to have the same
IDs for all actions. This is why I mention location above. 

See this blog post and slide deck for more description of the algo:

BTW to illustrate how powerful this idea is, I have a client that sells one item a year on
average to a customer. It’s a very big item and has a lifetime of one year. So using ALS
you could only train on the purchase and if you were gathering a year of data there would
be precious little training data. Also when you have a user with no purchase it is impossible
to recommend. ALS fails on all users with no purchase history. However with CCO, all the user
journey and any data about the user you can gather along the way can be used to recommend
something to purchase. So this client would be able to recommend to only 20% of their returning
shoppers with ALS and those recs would be low of quality based on only one event far in the
past. CCO using all the clickstream (or important parts of it) can do quite well.

This may seem an edge case but only in degree, every ecom app has data they are throwing away
and CCO addresses this.

On Dec 13, 2016, at 7:04 AM, Niklas Ekvall <> wrote:

Thanks Pat for that information!

I was meant to handle number of clicks or number of downloads and not
rating. But this is not a problem if the Spark doesn't handle values, I
have other algorithms who can handle that. How ever, I am quite curios
about the occurrences, cooccurrences, and cross-occurrences concept.

Can the following be a way to handle different data types?

  - occurrences - purchase history
  - cooccurrences - purchase history/likes
  - cross-occurrences - purchase history/clicks or downloads

Best, Niklas

2016-12-01 18:47 GMT+01:00 Pat Ferrel <>:

> No you can’t, the value is ignored. The algorithm looks at occurrences,
> cooccurrences, and cross-occurrences of several event types not values
> attached to events.
> If you are trying to use rating info, this has been pretty much discarded
> as being not very useful. For instance you may like comedy movies but they
> always get lower ratings than drama (raters bias) so using ratings to
> recommend items is highly problematic, but if a user watched a movie, that
> is a good indicator that they liked it and that is a boolean value. With
> cross-occurrence you can also use dislike as an indicator of preference but
> this is also boolean—a thumbs down.
> To see an end-to-end recommender with all the necessary surrounding
> infrastructure check the Apache-PredictionIO project and the Universal
> Recommender, which uses the code behind spark-itemsimilarity to serve
> recommendations. Read about the UR here: <
> On Nov 30, 2016, at 6:58 AM, Niklas Ekvall <>
> wrote:
> I found that you can, so ignore my question!
> Best reagrds, Niklas
> 2016-11-30 15:42 GMT+01:00 Niklas Ekvall <>:
>> Hello!
>> I'm using *spark-itemsimilarity *to produce related recommendations and
>> the input data has the form *userID, itemID. *Could I also use the from
> *userID,
>> itemID, value* (value > 0)? Or does *spark-itemsimilarity* only handles
>> binary values?
>> Best regards, Niklas

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message