mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <>
Subject Dirichlet clustering woes ...
Date Thu, 24 Feb 2011 22:18:37 GMT
My colleague Szymon and I have been working on Mahout-588 and hoped to
include Dirichlet in our clustering benchmarks, but unfortunately have not
had much success. So we're reaching out to the community to see if anyone
else has been successful with somewhat large-scale Dirichlet clustering.

Specifically, we have  6,077,604 sparse TFIDF vectors generated from the
Apache Mail Archives.

Using vectors with 40K dimensions on a 5-node cluster it runs nicely until
map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%,
7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h nothing.
The CPUs at the nodes run with almost 100% and full 6GB.

So then we tried vectors with 20K dimensions and were able to complete 1
iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each
percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of
reducers set to 1.

The job args we're using are:

bin/mahout dirichlet \
    -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
    -o /asf-mail-archives/mahout-0.4/dirichlet/ \
    -a0 1.0 \
    -x 10 \
    --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure \
    -k 60

We're still studying the code to diagnose ourselves, but also wanted to get
some feedback.

Kind regards,

Timothy Potter

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message