mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Need a little help with using SVD
Date Mon, 14 Mar 2011 03:28:18 GMT
Oops, hijacked. Starting a new thread.

On Sun, Mar 13, 2011 at 8:23 PM, Lance Norskog <goksron@gmail.com> wrote:
> What down-projection techniques are available in Mahout, and what
> others would be useful? For example, I'm intrigued by the
> manifold-finders like ISOMAP.
>
> Lance
>
> On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> For clustering purposes, you probably don't even need SVD here.  You can
>> project randomly down to 100-200 dimensions and do the clustering.  You have
>> to use a higher number of dimensions than you would with SVD, but avoiding
>> the SVD is a big win.  Depending on the density of your data, this may or
>> may not make clustering faster.  The key question is whether the total data
>> size is larger or smaller.
>>
>> Also, since your data is essentially count data, you have large amounts of
>> noise which probably make everything after about 20-30 singular vectors into
>> random noise anyway.  As such, I recommend replacing later singular vectors
>> with random numbers anyway.  These will be quasi-orthogonal and thus pretty
>> much as good as real singular vectors for reducing dimensionality, not quite
>> so good as a minimal basis.
>>
>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabdude@gmail.com>wrote:
>>
>>> Looking for a little clarification with using SVD to reduce dimensions of
>>> my
>>> vectors for clustering ...
>>>
>>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>>>
>>> bin/mahout svd -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>>    -o /asf-mail-archives/mahout-0.4/svd \
>>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>>
>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>>> 87, but I'm assuming that has something to do with Lanczos???
>>>
>>> So then I proceeded to transpose the SVD output using:
>>>
>>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>>> --numRows 87
>>>
>>> Next, I tried to run transpose on my original vectors using:
>>>
>>> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>>> --numCols 20444 --numRows 6076937
>>>
>>> This failed with error:
>>>
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>>> to org.apache.hadoop.io.IntWritable
>>>        at
>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> So I think I'm missing something ... I'm basing my process on the steps
>>> outlined in thread:
>>>
>>> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html
>>> ,
>>> i.e.
>>>
>>> bin/*mahout* *svd* (original -> *svdOut*)
>>> bin/*mahout* cleansvd ...
>>> bin/*mahout* *transpose* *svdOut* -> *svdT*
>>> bin/*mahout* *transpose* original -> originalT
>>> bin/*mahout* matrixmult originalT *svdT* -> newMatrix
>>> bin/*mahout* kmeans newMatrix
>>>
>>> Based on Ted's last comment in that thread, it seems like I may not need to
>>> transpose the original matrix? Just want to be sure this process is
>>> correct.
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message