mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Mahout V2
Date Sat, 05 Jul 2014 22:19:02 GMT
I compared  spark-itemsimilatity to the Hadoop version on sample data that is 8.7 M, 49290
x 139738 using my little 2 machine cluster and got the following speedup. 

Platform			Elapsed Time
Mahout Hadoop	0:20:37
Mahout Spark		0:02:19

This isn’t quite apples to apples because the Spark version does all the dictionary management,
which is usually two extra jobs tacked on before and after the Hadoop job. I’ve done the
complete pipeline using Hadoop and Spark now and can say that not only is it faster now but
the old Hadoop way required keeping track of 10x more intermediate data and connecting up
many more jobs to get the pipeline working. Now it’s just one job. You don’t need to worry
about ID translation anymore and you get over 10x faster completion — this is one of those
times when speed meets ease-of-use. 
View raw message