mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niklas Ekvall <niklas.ekv...@gmail.com>
Subject Re: Question about spark-itemsimilarity
Date Sun, 12 Feb 2017 09:48:28 GMT
Thanks Pat!









*> Finally, why and when do i want to use the following control option? > >
Algorithm control options: >  -mppu <value> | --maxPrefs <value> >
Max number of preferences to consider per user (optional). Default: 500
This tells spark-itemsimilarity to subsample the data to use only a max of
500 events per user. This is so the training time doesn’t increase forever
with more data and it has been shown that with ecom data that the point of
diminishing returns is about 500 events per user.*
We have a lot of data and when we run spark-itemsimilarity on EC2-machine
we get memory issues. Two ways to handle this problem could be to decrease
the number of recommendations and/or to set the "*Max number of preferences
to consider per user (optional). Default: 500" *to something below 500. What
happens if we set it to 250? How will this effect our recommendation
quality?

Do you have other smart tips to handle our memory problem?

Best regards, Niklas

2017-01-15 22:30 GMT+01:00 Pat Ferrel <pat@occamsmachete.com>:

>
> > On Jan 14, 2017, at 2:41 AM, Niklas Ekvall <niklas.ekvall@gmail.com>
> wrote:
> >
> > Thanks again Pat!
> >
> > Have some other question to you I hope you could help me with:
> >
> >
> > *So in your example below purchase history is the conversion action,
> > likes, and downloads aresecondary actions looked at as
> > cross-occurrences. *
>
> yes
>
> >
> > We want to analyze data from a app so we have other data types like:
> > downloaded, likes,
> > recommendations shown, recommendations ignored and I guess these
> > action is quite good to
> > use as secondary actions. Today we feed the algorithms with episodes
> > that the users has
> > consumed, before we do that we filter out episodes we don't want to
> > recommend. Is it possible to
> > do this type of filtering inside spark-itemsimilarity*?*
>
> no, any filtering must be done when preparing your data. Also I’d avoid
> sending recs shown and ignored because this sounds like it might cause
> overfitting. The recommender likes to see events that don’t *only* come
> from recommendations. Most apps have many ways to discover items so this is
> not a problem but if you had an app that only showed recommendations, you
> would end up with self-fulfilling recommendations. This problem is called
> “overfitting” in the ML world.
>
> >
> > Finally, why and when do i want to use the following control option?
> >
> > Algorithm control options:
> >  -mppu <value> | --maxPrefs <value>
> >        Max number of preferences to consider per user (optional).
> Default: 500
> >
>
> This tells spark-itemsimilarity to subsample the data to use only a max of
> 500 events per user. This is so the training time doesn’t increase forever
> with more data and it has been shown that with ecom data that the point of
> diminishing returns is about 500 events per user.
>
> > Best regards, Niklas
> >
> >
> > 2016-12-15 3:23 GMT+01:00 Pat Ferrel <pat@occamsmachete.com>:
> >
> >> Cross-occurrence allows us to ask the question: are 2 events correlated.
> >>
> >> To use the Ecom example, purchase is the conversion or primary action, a
> >> detail page view might be related but we must test each
> cross-occurrence to
> >> make sure. I know for a fact that with many ecom datasets it is
> impossible
> >> to treat these events as the same thing and get anything but a drop in
> >> quality of recommendations (I’ve tested this). People that use the ALS
> >> recommender in Spark’s MLlib sometimes tell you to weight the view less
> >> than the purchase. But this is nonsense (again I’ve tested this). What
> is
> >> true is that *some* views lead to purchases and others do not. So
> treating
> >> them all with the same weight is pure garbage.
> >>
> >> What CCO does is find the views that seem to lead to purchase. It can
> also
> >> find category-preferences that lead to certain purchases, as well as
> >> location-preference (triggered by a purchase when logged in from some
> >> location).  And so on. Just about anything you know about users or can
> >> phrase as a possible indicator of user taste can be used to get lift in
> >> quality of recommendation.
> >>
> >> So in your example below purchase history is the conversion action,
> likes,
> >> and downloads are secondary actions looked at as cross-occurrences. Note
> >> that we don’t need to have the same IDs for all actions. This is why I
> >> mention location above.
> >>
> >> See this blog post and slide deck for more description of the algo:
> >> http://actionml.com/blog/cco <http://actionml.com/blog/cco>
> >>
> >>
> >> BTW to illustrate how powerful this idea is, I have a client that sells
> >> one item a year on average to a customer. It’s a very big item and has a
> >> lifetime of one year. So using ALS you could only train on the purchase
> and
> >> if you were gathering a year of data there would be precious little
> >> training data. Also when you have a user with no purchase it is
> impossible
> >> to recommend. ALS fails on all users with no purchase history. However
> with
> >> CCO, all the user journey and any data about the user you can gather
> along
> >> the way can be used to recommend something to purchase. So this client
> >> would be able to recommend to only 20% of their returning shoppers with
> ALS
> >> and those recs would be low of quality based on only one event far in
> the
> >> past. CCO using all the clickstream (or important parts of it) can do
> quite
> >> well.
> >>
> >> This may seem an edge case but only in degree, every ecom app has data
> >> they are throwing away and CCO addresses this.
> >>
> >> On Dec 13, 2016, at 7:04 AM, Niklas Ekvall <niklas.ekvall@gmail.com>
> >> wrote:
> >>
> >> Thanks Pat for that information!
> >>
> >> I was meant to handle number of clicks or number of downloads and not
> >> rating. But this is not a problem if the Spark doesn't handle values, I
> >> have other algorithms who can handle that. How ever, I am quite curios
> >> about the occurrences, cooccurrences, and cross-occurrences concept.
> >>
> >> Can the following be a way to handle different data types?
> >>
> >>  - occurrences - purchase history
> >>  - cooccurrences - purchase history/likes
> >>  - cross-occurrences - purchase history/clicks or downloads
> >>
> >> Best, Niklas
> >>
> >> 2016-12-01 18:47 GMT+01:00 Pat Ferrel <pat@occamsmachete.com>:
> >>
> >>> No you can’t, the value is ignored. The algorithm looks at occurrences,
> >>> cooccurrences, and cross-occurrences of several event types not values
> >>> attached to events.
> >>>
> >>> If you are trying to use rating info, this has been pretty much
> discarded
> >>> as being not very useful. For instance you may like comedy movies but
> >> they
> >>> always get lower ratings than drama (raters bias) so using ratings to
> >>> recommend items is highly problematic, but if a user watched a movie,
> >> that
> >>> is a good indicator that they liked it and that is a boolean value.
> With
> >>> cross-occurrence you can also use dislike as an indicator of preference
> >> but
> >>> this is also boolean—a thumbs down.
> >>>
> >>> To see an end-to-end recommender with all the necessary surrounding
> >>> infrastructure check the Apache-PredictionIO project and the Universal
> >>> Recommender, which uses the code behind spark-itemsimilarity to serve
> >>> recommendations. Read about the UR here: http://actionml.com/docs/ur <
> >>> http://actionml.com/docs/ur>
> >>>
> >>> On Nov 30, 2016, at 6:58 AM, Niklas Ekvall <niklas.ekvall@gmail.com>
> >>> wrote:
> >>>
> >>> I found that you can, so ignore my question!
> >>>
> >>> Best reagrds, Niklas
> >>>
> >>> 2016-11-30 15:42 GMT+01:00 Niklas Ekvall <niklas.ekvall@gmail.com>:
> >>>
> >>>> Hello!
> >>>>
> >>>> I'm using *spark-itemsimilarity *to produce related recommendations
> and
> >>>> the input data has the form *userID, itemID. *Could I also use the
> from
> >>> *userID,
> >>>> itemID, value* (value > 0)? Or does *spark-itemsimilarity* only
> handles
> >>>> binary values?
> >>>>
> >>>> Best regards, Niklas
> >>>>
> >>>
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message