Ted, I'm having issues with clustering the data with R. It wants to
convert it into a dense matrix for clustering apparently.
> kmeans(M, 20, iter.max=20)
Error in asMethod(object) : cannot allocate vector of length 1705007196
There's an as.matrix(...) call that's responsible.
There's the biganalytics package [1] which supports filebacked
matrices, but if I attempt to make my sparse matrix a big.matrix,
it'll still fail:
> big.matrix(M)
Error in nrow < 1 : cannot allocate vector of length 1705007196
So, I think there's no way I can read it as a sparse Market Matrix and
run kmeans on it. On the other hand, if I want to use bigkmeans
provided by biganalytics, but that doesn't work directly either
> bigkmeans(M, 20, iter.max=20)
Error in duplicated.default(centers[[length(centers)]]) :
duplicated() applies only to vectors
So, it seems that I have to read in a big.matrix, from disk, but that
would mean building a dense CSV file like I tried earlier. That would
be over 12GB in size though...
Any other ideas?
[1] http://cran.rproject.org/web/packages/biganalytics/biganalytics.pdf
On Tue, Nov 27, 2012 at 9:46 PM, Dan Filimon
<dangeorge.filimon@gmail.com> wrote:
> Running kmeans in R with the projected 50dimensional vectors gets me
> the following sizes for the 20 clusters:
>
> Kmeans clustering with 20 clusters of sizes 140, 1195, 228, 3081,
> 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66,
> 78, 2962
>
> I guess projecting them might be the issue... (this is for 50 iterations).
>
> On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> Wrong in the sense of clustering is hard to define. Certainly a wide range
>> of cluster sizes looks dubious, but not definitive.
>>
>> Next easy steps include cosine normalizing the vectors and doing
>> semisupervised clustering. Clustering the 50d data in R might also be
>> useful. Normalizing is a single method call in the normal flow. It can be
>> done on the projected vectors without loss of generality. After cosine
>> normalization, semisupervised clustering can be done by adding an
>> additional 20 dimensions with a 1 of n encoding of the correct newsgroup.
>> IN the test data, these can be set to all zeros. This gives the
>> clustering algorithm a strong hint about what you are thinking about.
>>
>> It is also worth checking the sum os squared distance to make sure it is
>> relatively small.
>>
>> On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon <dangeorge.filimon@gmail.com>wrote:
>>
>>> They're both wrong! :(
>>>
