mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal Santoshi <vishal.santo...@gmail.com>
Subject Re: MinHash/ItemBased
Date Tue, 25 Oct 2011 16:27:55 GMT
Exactly as you said. And as you may have deciphered the domain I am working
for is very akin to google's.
MinHash ( and thus Jacquard's similarity ) does scale as it reduces users
cluster computation to user's data, but has different set of issues and thus
the PLSI as well as the co-occurance ( and that makes us go towards NOSQL
Cassandra/MongoDB or HBase ). For me Item Based recommendation is fairly
precise with less or no complexity ( apart from the scale issue )  and thus
pretty straight forward.

As Sean has predicted, the problem ( we and google face ) is not essentially
tailor made for Item Based Recommendation.
A hybrid has to be found IMHO.




On Tue, Oct 25, 2011 at 12:16 PM, Sebastian Schelter <ssc@apache.org> wrote:

> The Google News paper you cite follows an approach very different from
> the one implemented in RecommenderJob.
>
> Their approach has a very high complexity and they chose to use it
> because of the extreme item churn in the news domain.
>
> The techniques in the Google paper (MinHash and PLSI) are used compute
> user similarities (clusters of users, MinHash just looks at the ratio of
> co-read stories, PLSI tries to cluster the users according to some
> latent features in their interactions). A third component tracks co-read
> stories in realtime and a user is recommended stories that are co-read
> from other users in his clusters.
>
> --sebastian
>
> On 25.10.2011 18:07, Vishal Santoshi wrote:
> > Yep, Please keep me posted.
> > BTW , this is exactly why MinHash picked my curiosity and that seems to
> be
> > affirmed by
> >
> >
> http://www.datawrangling.com/google-paper-on-parallel-em-algorithm-using-mapreduce
> >
> > MinHash scales , such that the offline periodic component ( based on
> > hadoop/mahout yes mahout has a Minhash based clustering Driver )  seems
> > promising.
> > Again please keep the forum posted on how you go about doing this.
> >
> > Regards,
> >
> > Vishal.
> >
> > On Tue, Oct 25, 2011 at 11:55 AM, Sean Owen <srowen@gmail.com> wrote:
> >
> >> Oh I see, right.
> >>
> >> Well, one general strategy is to use Hadoop to compute the
> >> recommendations regularly, but not nearly in real-time. Then, use the
> >> latest data to imperfectly update the recommendations in real-time.
> >> So, you always have slightly stale recommendations, and item-item
> >> similarities to fall back on, and are reloading those periodically.
> >> Then you're trying to update any recently changed item or user in
> >> real-time using item-based recommendation, which can be fast.
> >>
> >> It's a really big topic in its own right, and there's no complete
> >> answer for you here, but you can piece this together from Mahout
> >> rather than build it from scratch.)
> >>
> >> (This is more or less exactly what I have been working on separately,
> >> a hybrid Hadoop-based / real-time recommender that can handle this
> >> scale but also respond reasonably to new data.)
> >>
> >> On Tue, Oct 25, 2011 at 4:44 PM, Vishal Santoshi
> >> <vishal.santoshi@gmail.com> wrote:
> >>> They are all active in a day. I am talking about 8.3 million active
> users
> >> a
> >>> day.
> >>> A significant fraction of them will be new users ( say about 2-3
> million
> >> of
> >>> them ).
> >>> Further the churn on items is likely to make historical recommendations
> >>> obsolete.
> >>> Thus if I have recommendations that were good of user A yesterday, they
> >> are
> >>> likely to be far less a weight as of today.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Oct 25, 2011 at 11:32 AM, Sean Owen <srowen@gmail.com> wrote:
> >>>
> >>>> On Tue, Oct 25, 2011 at 4:08 PM, Vishal Santoshi
> >>>> <vishal.santoshi@gmail.com> wrote:
> >>>>> In our case the preferences is  a user clicking on an article (
which
> >>>>> doubles as an item ).
> >>>>> And these articles are introduced at a frequent rate. Thus the number
> >> of
> >>>> new
> >>>>> items that
> >>>>> occur in the dataset has a very frequent churn and thus not
> >> necessarily
> >>>>> having any history.
> >>>>> Of course we need to recommend the latest item.
> >>>>
> >>>> OK, but I'm still not seeing why all users need an update every time.
> >>>> Surely most of the 8.3M users aren't even active in a given day.
> >>>>
> >>>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message