mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Spectral clustering - a bundle of issues
Date Wed, 07 Sep 2011 22:48:35 GMT
Cool! Also, DisplaySpectralClustering does not work. It has some problems
with the data directory names.  I did not succeed in tracking these names
via eclipse.

https://issues.apache.org/jira/browse/MAHOUT-524

On Wed, Sep 7, 2011 at 5:45 AM, Dan Brickley <danbri@danbri.org> wrote:

> Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
> ... seems perhaps some code rot?
>
> Can anyone else report success with Spectral clustering against recent
> trunk?
>
> Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
> 10 --dimensions 37
>
> ...with the small example affinity file we discussed yesterday, I hit
> a series of problems.
>
> data: http://danbri.org/2011/mahout/afftest.txt
>
> 1. As I mentioned in comments in
> http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
> (both for local pseudo-cluster, and a real one) I had to patch in
> calls to job.setJarByClass before job.waitForCompletion. This problem
> occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
> MAHOUT-197, but I presume it can't be hitting everyone. From grepping
> around, this might not be the only component missing setJarByClass
> calls. Or is this just me, somehow?
>
> 2. Newlines in the input data made it fail, but the associated warning
> from AffinityMatrixInputMapper was very vague. I'd suggest allowing
> those and #-comments, but maybe not a good idea to make per-component
> syntax designs? Suggest also it's worth printing the problem line (see
> patch below) when complaining.
>
> 3. Failing to load the affinity matrix (surely a requirement for
> further progress?) does not seem to halt the job, I see exceptions
> mixed in with ongoing processing (until a later problem hits us).
> Transcript: https://gist.github.com/1200455 ... actually it wasn't
> clear if the newline problem was more of a warning, and other rows
> from the input data were accepted. In which case, reporting them as
> java.io.IOException seems a bit draconian. So maybe bits of the input
> file were in fact loaded. It would be great to clarify what expected
> behaviour is.
>
>
> 4. After all that, the job still fails. Full transcript here:
> https://gist.github.com/1200428
>
> Excerpt: (I've added a bit more reporting output in a few places)
>
> 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
> specout/calculations/diagonal/part-r-00000
> Exception in thread "main" java.util.NoSuchElementException
>        at
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>        at
> org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)
>
> However that file does exist in hdfs, and seqdumper seems to accept
> it; it just seems empty:
>
> Input Path: specout/calculations/diagonal/part-r-00000
> Key class: class org.apache.hadoop.io.NullWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> I've posted an informal composite patch at
>
> https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
>  ... if you can confirm the above issues and a breakdown into JIRAs,
> I'll attach cleaner patches where appropriate.
>
> Looking forward to getting this running,
>
> cheers,
>
> Dan
>



-- 
Lance Norskog
goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message