Oops, hijacked. Starting a new thread.
On Sun, Mar 13, 2011 at 8:23 PM, Lance Norskog <goksron@gmail.com> wrote:
> What downprojection techniques are available in Mahout, and what
> others would be useful? For example, I'm intrigued by the
> manifoldfinders like ISOMAP.
>
> Lance
>
> On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> For clustering purposes, you probably don't even need SVD here. You can
>> project randomly down to 100200 dimensions and do the clustering. You have
>> to use a higher number of dimensions than you would with SVD, but avoiding
>> the SVD is a big win. Depending on the density of your data, this may or
>> may not make clustering faster. The key question is whether the total data
>> size is larger or smaller.
>>
>> Also, since your data is essentially count data, you have large amounts of
>> noise which probably make everything after about 2030 singular vectors into
>> random noise anyway. As such, I recommend replacing later singular vectors
>> with random numbers anyway. These will be quasiorthogonal and thus pretty
>> much as good as real singular vectors for reducing dimensionality, not quite
>> so good as a minimal basis.
>>
>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabdude@gmail.com>wrote:
>>
>>> Looking for a little clarification with using SVD to reduce dimensions of
>>> my
>>> vectors for clustering ...
>>>
>>> Using the ASF mail archives for Mahout588, I have 6,076,937 tfidf vectors
>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>>>
>>> bin/mahout svd i
>>> /asfmailarchives/mahout0.4/sparse1gramstem/tfidfvectors \
>>> o /asfmailarchives/mahout0.4/svd \
>>> rank 100 numCols 20444 numRows 6076937 cleansvd true
>>>
>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>>> 87, but I'm assuming that has something to do with Lanczos???
>>>
>>> So then I proceeded to transpose the SVD output using:
>>>
>>> bin/mahout transpose i /mnt/dev/svd/cleanEigenvectors numCols 20444
>>> numRows 87
>>>
>>> Next, I tried to run transpose on my original vectors using:
>>>
>>> transpose i /asfmailarchives/mahout0.4/sparse1gramstem/tfidfvectors
>>> numCols 20444 numRows 6076937
>>>
>>> This failed with error:
>>>
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>>> to org.apache.hadoop.io.IntWritable
>>> at
>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> So I think I'm missing something ... I'm basing my process on the steps
>>> outlined in thread:
>>>
>>> http://lucene.472066.n3.nabble.com/UsingSVDwithCanopyKMeanstd1407217.html
>>> ,
>>> i.e.
>>>
>>> bin/*mahout* *svd* (original > *svdOut*)
>>> bin/*mahout* cleansvd ...
>>> bin/*mahout* *transpose* *svdOut* > *svdT*
>>> bin/*mahout* *transpose* original > originalT
>>> bin/*mahout* matrixmult originalT *svdT* > newMatrix
>>> bin/*mahout* kmeans newMatrix
>>>
>>> Based on Ted's last comment in that thread, it seems like I may not need to
>>> transpose the original matrix? Just want to be sure this process is
>>> correct.
>>>
>>
>
>
>
> 
> Lance Norskog
> goksron@gmail.com
>

Lance Norskog
goksron@gmail.com
