mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: First results from (non-distributed) Apache Mahout / Yahoo KDD Cup
Date Mon, 21 Mar 2011 17:34:18 GMT
Nice, Sean!  I love that a near out of the box implementation puts us in 4th place!

On Mar 20, 2011, at 8:03 PM, Sean Owen wrote:

> I've been test-driving a simple application of Mahout recommenders
> (the non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data
> set (kddcup.yahoo.com).
> 
> In the spirit of open-source, like I mentioned, I'm committing the
> extra code to mahout-examples that can be used to run a Recommender on
> the input and output the right format. And, I'd like to publish the
> rough timings too. Find all the source in
> org.apache.mahout.cf.taste.example.kddcup
> 
> 
> Track 1
> 
> m2.2xlarge instance, 34.2GB RAM / 4 cores
> Steady state memory consumption: ~19GB
> Computation time: 30 hours (wall clock-time)
> CPU time per user: ~0.43 sec
> Cost on EC2: $34.20 (!)
> 
> (Helpful hint on cost I realized after the fact: you can almost surely
> get spot instances for cheaper. The maximum price this sort of
> instance has gone for as a spot instance is about $0.60/hour, vs
> "retail price" of $1.14/hour.)
> 
> Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is
> only good enough for 29th place at the moment. Not terrible for "out
> of the box" performance -- it's just using an item-based recommender
> with uncentered cosine similarity. But not really good in absolute
> terms. A winning solution is going to try to factor in time, and apply
> more sophisticated techniques. The best RMSE so far is about 23.
> 
> 
> Track 2
> 
> c1.xlarge instance: 7GB RAM / 8 cores
> Steady state memory consumption: ~3.8GB
> Computation time: 4.1 hours (wall clock-time)
> CPU time per user: ~1.1 sec
> Cost on EC2: $3.20
> 
> For this I bothered to write a simplistic item-item similarity metric
> to take into account the additional info that is available: track,
> artist, album, genre. The result was comparatively better: 17.92%
> error rate, good enough for 4th place at the moment.
> 
> 
> Of course, the next task is to put this through the actual distributed
> processing -- that's really the appropriate solution.
> 
> This shows you can still tackle fairly impressive scale with a
> non-distributed solution. These results suggest that the largest
> instances available from EC2 would accomodate almost 1 billion ratings
> in memory. However at that scale running a user's full recommendations
> would easily be measured in seconds, not milliseconds.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Mime
View raw message