mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Item similarity very slow
Date Mon, 22 Jun 2009 20:32:17 GMT
Performance issues like this are often solved in production settings by
dumping the entire database and then processing it using a high-speed batch
system (like hadoop).  The alternative is to build SQL queries that do all
of the heavy lifting of all of the coocurrence counting needed in one go.

Moving data out of a database in small drabs is typically too slow to use.
Most importantly, you wind up munching over the same data many times.  The
fundamental problem is the same for all computations that look like matrix
multiplication ... you need to do n^3 operations and you have n^2 data.  If
you move data for each operation, you are inflating the number of moves by a
factor of n.  Thus, you have to do lots of operations for each data move.

This applies just as much for sparse arrays ... for matrix sparsity p where
p n^2 is roughly constant as n grows, the total number of ops in a multiply
is p n^3 and the number of moves is p n^2 (ish) so you have the same degree
of inflation penalty for bad scheduling.

On Mon, Jun 22, 2009 at 1:13 PM, Sean Owen <> wrote:

> First, do you have indexes and constraints set on the table? the
> primary key should be a composite key, of these two IDs, and both
> should have an index. Both should be non-null.
> Are you wrapping TanimotoCoefficientSimilarity in a CachingSimilarity
> wrapper? this will at least help it cache the similarity computations.
> It won't help the first run, but will help subsequent runs a lot.
> How many items do you have?
> Let's start here and we can think of more solutions after we deal with
> these questions.
> On Mon, Jun 22, 2009 at 4:06 PM, charlysf<> wrote:
> >
> > Hello,
> >
> > I would like to compute the item similarity for my data.
> >
> > I have this table :
> >
> > item_id, subject_id
> >
> > An item is linked to a subject, which is a Taste, so I would like to have
> > the similarity between items, in fact, if they have the same subjects, or
> > not...
> >
> > I tried to implement an AbstractJDBCDataModel for my database, and as I
> have
> > some boolean relationship between my item and my subject, I compute
> > similarities with TanimotoCoefficientSimilarity.
> >
> > My recommender is GenericItemBasedRecommender and I use a
> > CachingRecommender.
> >
> > In fact, do I have a better solution than :
> >
> > for each item as item1
> >     give me the neighborhood(item1)
> >
> >
> > To retrieve the first neighborhood, I need around 20sec !
> >
> > This is my log :
> >

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message