mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <sebastian.schel...@zalando.de>
Subject Re: Taste - datamodel
Date Tue, 01 Jun 2010 05:56:38 GMT
I'm reading this discussion with great interest.

As you stress the importance of keeping the item-similarity-matrix sparse, I
think it would be a useful improvement to add an option like
"maxSimilaritiesPerItem" to
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it
try to cut down the number of similar items per item.

However as we store each similarity pair only once it could happen that
there are more than "maxSimilaritiesPerItem" similar items for a single item
as we can't drop some of the pairs because the other item in the pair might
have to little similarities otherwise.

I could add this feature if you agree that its useful this way.

If one wishes to drop similarities below a certain cutoff, this could be
done in a custom implementation of
o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply
returning NaN if the computed similarity is below that cutoff value.

-sebastian

2010/6/1 Ted Dunning <ted.dunning@gmail.com>

> I normally deal with this by purposefully limiting the length of these
> rows.
>  The argument is that if I never recommend more than 100 items to a person
> (or 20 or 1000 ... the argument doesn't change), then none of the item ->
> item* mappings needs to have more than 100 items since the tail of the list
> can't affect the top 100 recommendations anyway.  It is also useful to
> limit
> the user history to either only recent or only important ratings.  That
> means that a typical big multi-get is something like 100 history items x
> 100
> related items = 10,000 items x 10 bytes for id+score.  This sounds kind of
> big, but the average case is 5x smaller.
>
> On Mon, May 31, 2010 at 4:01 PM, Sean Owen <srowen@gmail.com> wrote:
>
> > I'd be a little concerned about whether this fits comfortably in
> > memory. The similarity matrix is potentially dense -- big rows -- and
> > you're loading one row per item the user has rated. It could get into
> > tens of megabytes for one query. The distributed version dares not do
> > this. But, worth a try in principle.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message