mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Огњен Кубат <e10...@gmail.com>
Subject Clustering of wiki dump.
Date Fri, 06 Jun 2014 13:30:50 GMT
Hi everyone,

I'm interested in clustering of wikipedia articles dump (~45GB xml) with
kmeans or fkmeans. Can anyone tell me something about required architecture
of hadoop cluster for this size of job? I have tried to do clustering on
cluster of 20 quad core with 32GB of RAM each, but unfortunatley I did not
have success. Is this architecture enough? How could I set up memory for
map and reduce jobs for hadoop? Am I doing something wrong?

This is how my kmeans command look like:

bin/mahout kmeans -i /out-vectors/tfidf-vectors -o /kmeans/clusters -c
/kmeans/initial -xm mapreduce --maxIter 8 --numClusters 9000 --clustering
--overwrite \
-Dmapreduce.map.java.opts=-Xmx6g \
-Dmapreduce.reduce.java.opts=-Xmx6g \
-Dmapred.child.java.opts=-Xmx6g \
-Dmapreduce.reduce.memory.mb=8192 -Dmapreduce.map.memory.mb=8192 \
-Dmapred.reduce.tasks=160

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message