mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Огњен Кубат <>
Subject Clustering of wiki dump.
Date Fri, 06 Jun 2014 13:30:50 GMT
Hi everyone,

I'm interested in clustering of wikipedia articles dump (~45GB xml) with
kmeans or fkmeans. Can anyone tell me something about required architecture
of hadoop cluster for this size of job? I have tried to do clustering on
cluster of 20 quad core with 32GB of RAM each, but unfortunatley I did not
have success. Is this architecture enough? How could I set up memory for
map and reduce jobs for hadoop? Am I doing something wrong?

This is how my kmeans command look like:

bin/mahout kmeans -i /out-vectors/tfidf-vectors -o /kmeans/clusters -c
/kmeans/initial -xm mapreduce --maxIter 8 --numClusters 9000 --clustering
--overwrite \ \ \ \
-Dmapreduce.reduce.memory.mb=8192 \

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message