So you're clustering 90K dimensional data?
I'm faced with a very similar problem as you (working on
StreamingKMeans for mahout) and from what I read [1], the problem
might be that in very high dimensional spaces the distances become
meaningless.
I'm pretty sure this is the case and I was considering implementing
the test mentioned in the paper (also I feel like it's a very useful
algorithm to have).
In any case, since the vectors are so sparse, why not reduce their dimension?
You can try principal component analysis (just getting the first k
eigenvectors in the singular value decomposition of the matrix that
has your vectors as rows). The class that does this is SSVDSolver
(there's also SingularValueDecomposition but that tries making dense
matrices and those might not fit into memory. I've never personally
used it though.
Once you have the first k eigenvectors of size n, make them rows in a
matrix (U) and multiply each vector x you have with it (U x) getting a
reduced vector.
Or, use random projections to reduce the size of the data set. You
want to create a matrix whose entries are sampled from a uniform
distribution (0, 1) (Functions.random in
o.a.m.math.function.Functions), normalize its rows and multiply each
vector x with it.
So, reduce the size of your vectors thereby making the dimensionality
less of a problem and you'll get a decent approximation (you can
actually quantify how good it is with SVD). From what I've seen, the
clusters separate at smaller dimensions but there's the question of
how good an approximation of the uncompressed data you have.
See if this helps, I need to do the same thing :)
What do you think?
[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
<sebastian.briesemeister@unistergmbh.de> wrote:
> The dataset consists of about 4000 documents and is encoded by 90.000
> words. However, each document contains usually only about 10 to 20
> words. Only some contain more than 1000 words.
>
> For each document, I set a field in the corresponding vector to 1 if it
> contains a word. Then I normalize each vector using the L2norm.
> Finally I multiply each element (representing a word) in the vector by
> log(#documents/#documents_with_word).
>
> For clustering, I am using cosine similarity.
>
> Regards
> Sebastian
>
> Am 26.03.2013 17:33, schrieb Dan Filimon:
>> Hi,
>>
>> Could you tell us more about the kind of data you're clustering? What
>> distance measure you're using and what the dimensionality of the data
>> is?
>>
>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>> <sebastian.briesemeister@unistergmbh.de> wrote:
>>> Dear Mahoutusers,
>>>
>>> I am facing two problems when I am clustering instances with Fuzzy c
>>> Means clustering (cosine distance, random initial clustering):
>>>
>>> 1.) I always end up with one large set of rubbish instances. All of them
>>> have uniform cluster probability distribution and are, hence, in the
>>> exact middle of the cluster space.
>>> The cosine distance between instances within this cluster reaches from 0
>>> to 1.
>>>
>>> 2.) Some of my clusters have the same or a very very similar center.
>>>
>>> Besides the above described problems, the clustering seems to work fine.
>>>
>>> Has somebody an idea how my clustering can be improved?
>>>
>>> Regards
>>> Sebastian
>
