mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <dan...@danbri.org>
Subject Spectral clustering - a bundle of issues
Date Wed, 07 Sep 2011 12:45:51 GMT
Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
... seems perhaps some code rot?

Can anyone else report success with Spectral clustering against recent trunk?

Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
10 --dimensions 37

...with the small example affinity file we discussed yesterday, I hit
a series of problems.

data: http://danbri.org/2011/mahout/afftest.txt

1. As I mentioned in comments in
http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
(both for local pseudo-cluster, and a real one) I had to patch in
calls to job.setJarByClass before job.waitForCompletion. This problem
occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
MAHOUT-197, but I presume it can't be hitting everyone. From grepping
around, this might not be the only component missing setJarByClass
calls. Or is this just me, somehow?

2. Newlines in the input data made it fail, but the associated warning
from AffinityMatrixInputMapper was very vague. I'd suggest allowing
those and #-comments, but maybe not a good idea to make per-component
syntax designs? Suggest also it's worth printing the problem line (see
patch below) when complaining.

3. Failing to load the affinity matrix (surely a requirement for
further progress?) does not seem to halt the job, I see exceptions
mixed in with ongoing processing (until a later problem hits us).
Transcript: https://gist.github.com/1200455 ... actually it wasn't
clear if the newline problem was more of a warning, and other rows
from the input data were accepted. In which case, reporting them as
java.io.IOException seems a bit draconian. So maybe bits of the input
file were in fact loaded. It would be great to clarify what expected
behaviour is.


4. After all that, the job still fails. Full transcript here:
https://gist.github.com/1200428

Excerpt: (I've added a bit more reporting output in a few places)

11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
specout/calculations/diagonal/part-r-00000
Exception in thread "main" java.util.NoSuchElementException
	at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
	at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

However that file does exist in hdfs, and seqdumper seems to accept
it; it just seems empty:

Input Path: specout/calculations/diagonal/part-r-00000
Key class: class org.apache.hadoop.io.NullWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

I've posted an informal composite patch at
https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
 ... if you can confirm the above issues and a breakdown into JIRAs,
I'll attach cleaner patches where appropriate.

Looking forward to getting this running,

cheers,

Dan

Mime
View raw message