mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From P Kal <ruvi...@gmail.com>
Subject Re: Kmeans - clustering help
Date Sat, 07 Sep 2013 17:15:18 GMT
It seems that I've had the wrong idea the entire time. Thanks for the help.


On Fri, Sep 6, 2013 at 3:45 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> seq2sparse uses Lucene Standard tokenization to generate the tfidf
> vectors. But since your data is in CSV format (from the example u had
> provided below) you should be using Mahout's CSVVectorIterator to generate
> the vectors.
>
> See
> http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program
>
> Once you have generated the term vectors you also need to specify -cl
> option to the kmeans CLI to generate the clusters.
> Also you don't have to generate the centroids upfront (unless its
> something specific you ur use case), kmeans would generate random k
> centroids during execution.
>
>
>
>
>
> ________________________________
>  From: P Kal <ruvikal@gmail.com>
> To: user@mahout.apache.org
> Sent: Friday, September 6, 2013 2:05 PM
> Subject: Kmeans - clustering help
>
>
> I'm trying to a kmeans clustering on only numeric data
>
> This is how my data looks
> header1, header2 header3, header4, header5
> 0,0,0,0,0
> 1,3,2,4,5
> 3,2,4,5,6
> .
> .
> .
>
> about 3000 rows
>
> As the cluster centroids I created another file
> (0,0,0,0,0)
> (1,2,3,4,5)
>
> My understanding is that we'd have to change these text files to sequence
> files and then generate sparse vectors from this sequence file for kmeans
> clustering
>
> I've used the seqdirectory followed by seq2sparse,
> and at the end I have two folders, one for input and one for centroids
>
> Input folder has dirs generated by seq2sparse on the input sequence file
> Similarly the centroids folder has dirs generated by seq2sparse on the
> centroids sequence file
> The command I use to run kmeans
>
> mahout kmeans --input input/tfidf-vectors --output output -c
> centroids/tfidf-vectors --maxIter 20
> and I get this error
>
> No input clusters found in centroids/tfidf-vectors Check your -c argument.
>
> The sequence files have data but the files generated by seq2sparse do not
> have any contents.
> Can someone please help.
>
> BTW all this on hdfs and not local mode
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message