mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Extracting association rules
Date Wed, 14 Apr 2010 19:30:02 GMT
Yeah if you squint hard enough, many of these algorithms reduce to
being quite similar, or being applicable in similar situations.
They're specializations or recombinations of similar ideas into
different specific problem domains.

Because you've said the words "association rules", stuff like FP
growth sounds more appropriate. But I can describe what
mostSimilarItems() does in case it happens to suit you better.

It just returns the items with highest similarity to a given item,
where 'similarity' is defined by a given ItemSimilarity
implementation. Using an implementation like LogLikelihoodSimilarity,
you could easily discover items which co-occur unusually frequently.
Or with PearsonCorrelationSimilarity you could base the similarity
measure on traditional correlation of ratings -- if you have item
ratings.

You could copy-and-paste this method and modify it to simply discover
the item-item pairs with highest similarity over all pairs. It's very
simple.

The good and bad news about this method is it's not distributed. If
your data is medium-sized -- here my rule of thumb is roughly less
than 100M data points -- I bet it'll suit you fine to run a
non-distributed job based on this bit of code to do your work. If you
need a distributed solution... well you could pick out the map-reduce
phase in org.apache.mahout.cf.taste.hadoop.item which computes
co-occurrence and then write a second job to pick out the highest
co-occurrences. Very simple and quick as map-reduces go.


On Wed, Apr 14, 2010 at 3:20 PM, Sebastian Feher <sfeher@crossview.com> wrote:
> Hi All,
>
> I'm looking at extracting association rules with Mahout. If I understand it correctly,
both GenericItemBasedRecommender.mostSimilarItems() and Parallel FP-Growth seem to provide
support for doing that. Is this true? If not what are the major differences between the two
(including scalability, performance)? Thanks.
>
> Sebastian

Mime
View raw message