mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Item similarity very slow
Date Mon, 22 Jun 2009 20:36:20 GMT
I think you want an index on article_id too.

The SQL queries make sense since you are using
TanimotoCoefficientSimilarity. It is based entirely on counts of users
that prefer one item, another, or both.

I agree with Ted that using a database in this way hits performance
problems at moderate scale. You may have to do things like he
suggests, like I think you are already doing, such as pre-computing
similarities. That, you can do with one big SQL query.

Those similarity tables do grow as the square of the number of users
or items. You may consider only recording similarities that are
'significant' -- for example, if you are using Tanimoto, it doesn't
make sense to store a similarity of 0, which will be very common.
Assume any missing data points are 0.

On Mon, Jun 22, 2009 at 4:31 PM, charlysf<> wrote:
> In my table I have :
> 21000 rows, and 10 000 distinct article id
> Article_id and subject_id are not null
> there is a unique index on (article_id, subject_id) because I have an auto
> increment primary key on the table.
> I have also an index on : subject_id
> All index are B TREE.
> I use the CachingSimilarity, but in fact, it doesn't work, as I would like
> to compute the similarity only for one item.
> Is it normal that, each time, a new query is done to retrieve "Retrieving
> number of user preferring item in model 25" and "Retrieving number of user
> preferring items in model 25" and to compare with all rows ?

View raw message