mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Nikam <rajeshni...@gmail.com>
Subject Re: bottom up clustering
Date Mon, 03 Jun 2013 15:51:57 GMT
I am having 1500 points. Using km: k * log n
On Jun 3, 2013 8:53 PM, "Suneel Marthi" <suneel_marthi@yahoo.com> wrote:

> How many datapoints do u have in ur input?  How r u computing the value of
> -km?
>
>
>
>
> ________________________________
>  From: Rajesh Nikam <rajeshnikam@gmail.com>
> To: Suneel Marthi <suneel_marthi@yahoo.com>
> Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Ted Dunning <
> ted.dunning@gmail.com>
> Sent: Monday, June 3, 2013 9:55 AM
> Subject: Re: bottom up clustering
>
>
> I tried with below commands
>
> hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
> org.apache.mahout.utils.vectors.arff.Driver --input
> /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
> --dictOut /mnt/cluster/t/input-set-dict
>
> hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
>     -i /user/hadoop/t/input-set-vector \
>     -o /user/hadoop/t/skmeans \
>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
>   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
>   -k 4 \
>   -km 12 \
>   -testp 0.3 \
>   -mi 10 \
>   -ow
>
> and dumped with seqdumper
>
> hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
> org.apache.mahout.utils.SequenceFileDumper -i
> /user/hadoop/t/skmeans/part-r-00000 -o
> /mnt/cluster/t/skmeans-cluster-points.txt
>
> Dump contains centroids for clusters.
>
> ==>>
>
> This was small test-set for which I could guess number of clusters.
> As streaming kmeans require -k to be specified, how to do the same in case
> sample set is big.
>
> It also gives error like when k was specified as 40 to streamingkmeans.
>
> -k 40 \
> -km 190 \
>
> java.lang.IllegalArgumentException: Must have more datapoints [4] than
> clusters [40]
>         at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
>
> ==>>
>
> How to use these centroids for clustering ? I am not understanding its use.
>
> Thanks,
> Rajesh
>
>
>
>
>
>
>
> On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > You should be able to feed arff.vectors to Streaming kmeans (have not
> > tried that myself, never had to work with arff ).
> > I had tfidf-vectors as an example, u should be good with arff.
> >
> > Give it a try and let us know.
> >
> >
> >   ------------------------------
> >  *From:* Rajesh Nikam <rajeshnikam@gmail.com>
> > *To:* "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> > suneel_marthi@yahoo.com>
> > *Cc:* Ted Dunning <ted.dunning@gmail.com>
> > *Sent:* Monday, June 3, 2013 4:30 AM
> >
> > *Subject:* Re: bottom up clustering
> >
> > Hi Suneel,
> >
> > I have used seqdirectory followed by seq2sparse on 20newsgroup set.
> >
> > Then used following command to run streamingkmeans to get 40 clusters.
> >
> > hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
> >     -i /user/hadoop/news-vectors/tf-vectors/ \
> >     -o /user/hadoop/news-stream-kmeans \
> >   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> >   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> >   -k 40 \
> >   -km 190 \
> >   -testp 0.3 \
> >   -mi 10 \
> >   -ow
> >
> > dumped output using  seqdumper from
> > /user/hadoop/news-stream-kmeans/part-r-00000.
> >
> > In the dumped file I see centroids are dumped like:
> >
> > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> > Key: 0: Value: key = 0, weight = 1.00, vector =
> >
> {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
> > Key: 1: Value: key = 1, weight = 3.00, vector =
> >
> {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
> > Key: 2: Value: key = 2, weight = 105.00, vector =
> >
> {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
> > Key: 3: Value: key = 28, weight = 259.00, vector =
> >
> {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
> > --
> >
> >  more --- >
> > --
> >
> > I have tried using arff.vector to covert arff to vector where I dont know
> > how to covert it to tf-idf vectors format as expected by streaming
> kmeans ?
> >
> > Thanks
> > Rajesh
> >
> >
> >
> > On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshnikam@gmail.com
> >wrote:
> >
> > Hi Suneel,
> >
> > Thanks a lot for detailed steps !
> > I will try out the steps.
> >
> > Thanks, Ted for pointing this out!
> >
> > Thanks,
> > Rajesh
> >
> >
> > On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
> >
> > To add to Ted's reply, streaming k-means was recently added to Mahout
> > (thanks to Dan and Ted).
> >
> > Here's the reference paper that talks about Streaming k-means:
> >
> > http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
> >
> > You have to be working off of trunk to use this, its not available as
> part
> > of any release yet.
> >
> > The steps for using Streaming k-means (I don't think its been documented
> > yet)
> >
> > 1.  Generate Sparse vectors via seq2sparse (u have this already).
> >
> > 2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output path>
> > --tempDir <temp folder path> -ow
> >  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> >  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
> >  -k <No. of clusters> -km <see below for the math>
> >
> > -k = no of clusters
> > -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
> to
> > cluster,  round this to the nearest integer
> >
> > You have option of using a FastProjectionSearch or ProjectionSearch or
> > LocalitySensitiveHashSearch for the -sc parameter.
> >
> >
> >
> >
> >
> >
> >
> > ________________________________
> >  From: Ted Dunning <ted.dunning@gmail.com>
> > To: "user@mahout.apache.org" <user@mahout.apache.org>
> > Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> > suneel_marthi@yahoo.com>
> > Sent: Thursday, May 30, 2013 12:03 PM
> > Subject: Re: bottom up clustering
> >
> >
> > Rajesh
> >
> > The streaming k-means implementation is very much like what you are
> asking
> > for.  The first pass is to cluster into many, many clusters and then
> > cluster those clusters.
> >
> > Sent from my iPhone
> >
> > On May 30, 2013, at 11:20, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
> >
> > > Hello Suneel,
> > >
> > > I got it. Next step to canopy is to feed these centroids to kmeans and
> > > cluster.
> > >
> > > However I want is to use centroids from these clusters and do
> clustering
> > on
> > > them so as to find related clusters.
> > >
> > > Thanks
> > > Rajesh
> > >
> > >
> > > On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <
> suneel_marthi@yahoo.com
> > >wrote:
> > >
> > >> The input to canopy is your vectors from seq2sparse and not cluster
> > >> centroids (as u had it), hence the error message u r seeing.
> > >>
> > >> The output of canopy could be fed into kmeans as input centroids.
> > >>
> > >>
> > >>
> > >>
> > >> ________________________________
> > >> From: Rajesh Nikam <rajeshnikam@gmail.com>
> > >> To: "user@mahout.apache.org" <user@mahout.apache.org>
> > >> Sent: Thursday, May 30, 2013 10:56 AM
> > >> Subject: bottom up clustering
> > >>
> > >>
> > >> Hi,
> > >>
> > >> I want to do bottom up clustering (rather hierarchical clustering)
> > rather
> > >> than top-down as mentioned in
> > >>
> > >> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
> > >> kmeans->clusterdump->clusterpp and then kmeans on each cluster
> > >>
> > >> How to use centroid from first phase of canopy and use them for next
> > level
> > >> of course with correct t1 and t2.
> > >>
> > >> I have tried using 'canopy' which give centroids as output. How to
> apply
> > >> one more level of clustering on these centroids ?
> > >>
> > >> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
> > level
> > >> of canopy.
> > >>
> > >> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
> > >> /user/hadoop/t/hclust -dm
> > >> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2
> > 0.02
> > >> -ow
> > >>
> > >> It gave following error:
> > >>
> > >>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
> > >> attempt_201305231030_0519_m_000000_0, Status : FAILED
> > >> java.lang.ClassCastException:
> > >> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast
> to
> > >> org.apache.mahout.math.VectorWritable
> > >>
> > >> Thanks
> > >> Rajesh
> > >>
> >
> >
> >
> >
> >
> >

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message