mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <dangeorge.fili...@gmail.com>
Subject Vectorizing 20 newsgroups
Date Thu, 27 Dec 2012 19:53:09 GMT
Hi!

I'm finally getting back to work on Streaming KMeans! :)
The last thing I did was experiment with different ways of vectorizing
the 20 newsgroups data set and I wanted to project them in 3D and
check out  what I get.

The result is pretty odd, but I get it regardless of the method I use
to generate vectors.
It looks like someone splashed a 2D normal distribution on a sphere.

Here's an image from Ted's algorithm [2] and one from mine [3] using
log term-frequency scoring.
Ted's uses vectors of size 9000 with hashing (using
StaticWordValueEncoder) while mine uses vectors of size ~90000 with a
manual approach.

I think the vectorization actually went okay for both algorithms, but
maybe the projection is off?

The shape is odd. What am I doing wrong? :/

[1] https://gist.github.com/4391252
[2] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/ted-projected.png
[3] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/log-projected.png

Mime
View raw message