It's been over a year since I ran any tests of KMeans on larger data sets and there has been a lot of refactoring done in the interim. I was also using only dense vectors. It is entirely possible it is now doing something really poorly. I'm surprised that it is taking such a long time to munch such a small dataset but it sounds like you can reproduce it on a single machine so profiling should suggest the root cause. I'm going to be away from the computer for the next two weeks - a real vacation - so unfortunately I won't be able to contribute to this effort. Jeff Grant Ingersoll wrote: > > On Jul 27, 2009, at 12:00 AM, nfantone wrote: > >> Thanks, Grant. I just updated and notice the change. >> >> As a side note: you think someone could run some real tests on kMeans, >> in particular, other than the ones already in the project? I bet there >> are other naive (or not so naive) problems like that. After much >> coding, reading and experimenting in the last weeks with clustering in >> Mahout, I am inclined to say something may not fully work with kMeans, >> as of now. Or perhaps it just needs some refactoring/performance >> tweaks. Jeff have claimed to run the job over gigabytes of data, using >> a rather small cluster, in minutes. Have anyone tried to accomplish >> this recently (since the hadoop upgrade to 0.20)? Just use >> ClusteringUtils to write a file of some (arguably not so) significant >> number of random Vectors (say, 800.000+) and let that be the input of >> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose >> with little change). You'll end up with a file of about ~85MB to >> ~100MB, which can easily fit into memory in any modern computer. Now, >> run the whole thing (I've tried both, locally and using a three >> node-cluster setup - which, frankly, seemed like a bit too much >> computing power for such small number of items in the dataset). It'll >> take forever to complete. >> > > I hope to hit this soon. I've got some Amazon credits I need to use > and hope to put them towards this. > > As with any project in open source, we need people to kick the tires, > give feedback (thank you!) and also poke around the code to make it > better. > > Have you tried your data with some other clustering code, perhaps Weka > or something like that? > > >> This simple methods could be used to generate any given number of >> random SparseVectors for testing's sake, if anyone is interested: >> >> private static Random rnd = new Random(); >> private static final int CARDINALITY = 1200; >> private static final int MAX_NON_ZEROS = 200; >> private static final int MAX_VECTORS = 850000; >> >> private static Vector getRandomVector() { >> Integer id = rnd.nextInt(Integer.MAX_VALUE); >> Vector v = new SparseVector(id.toString(), CARDINALITY); >> int nonZeros = 0; >> while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0); >> for (int i = 0; i < nonZeros; i++) { >> v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble()); >> } >> return v; >> } >> >> private static List getVectors() { >> List vectors = new ArrayList(MAX_VECTORS); >> for (int i = 0; i < MAX_VECTORS; i++){ >> vectors.add(getRandomVector()); >> } >> return vectors; >> } >> > > > I'm not sure why testing with Random vectors would be all that useful > other than it shows it runs. I wouldn't expect anything useful to > come out of it, though. > > >> On Sun, Jul 26, 2009 at 10:30 PM, Grant >> Ingersoll wrote: >>> Fixed on MAHOUT-152 >>> >>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote: >>> >>>> That does indeed look like a problem. I'll fix. >>>> >>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote: >>>> >>>>> While (still) experiencing performance issues and inspecting kMeans >>>>> code, I found this lying around SquaredEuclideanDistanceMeasure.java: >>>>> >>>>> public double distance(double centroidLengthSquare, Vector centroid, >>>>> Vector v) { >>>>> if (centroid.size() != centroid.size()) { >>>>> throw new CardinalityException(); >>>>> } >>>>> ... >>>>> } >>>>> >>>>> I bet someone meant to compare centroid and v sizes and didn't >>>>> noticed. >>>>> >>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone wrote: >>>>>> >>>>>> Well, as it turned out, it didn't have anything to do with my >>>>>> performance issue but I found out that writing a Cluster (with a >>>>>> single vector as its center) to a file and then reading it, requires >>>>>> the center to be added as point; otherwise, you won't be able to >>>>>> retrieve it as it should. Therefore, one should do: >>>>>> >>>>>> // Writing >>>>>> String id = "someID"; >>>>>> Vector v = new SparseVector(); >>>>>> Cluster c = new Cluster(v); >>>>>> c.addPoint(v); >>>>>> seqWriter.append(new Text(id), c); >>>>>> >>>>>> // Reading >>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance(); >>>>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance(); >>>>>> while (seqReader.next(key, value)) { >>>>>> ... >>>>>> Vector centroid = value.getCenter(); >>>>>> ... >>>>>> } >>>>>> >>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think >>>>>> this shouldn't happen. Then again, it's not that relevant, I guess. >>>>>> >>>>>> Sorry for bringing different subjects to the same thread. >>>>>> >>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone wrote: >>>>>>> >>>>>>> I've been using RandomSeedGenerator to generate initial clusters >>>>>>> for >>>>>>> kMeans and while checking its code I stumbled upon this: >>>>>>> >>>>>>> while (reader.next(key, value)) { >>>>>>> Cluster newCluster = new Cluster(value); >>>>>>> newCluster.addPoint(value); >>>>>>> .... >>>>>>> } >>>>>>> >>>>>>> I can see it adds the vector to the newly created cluster, even >>>>>>> though >>>>>>> it is setting it as its center in the constructor. Wasn't this >>>>>>> corrected in a past revision? I thought this was not necessary >>>>>>> anymore. I'll look into it a little bit more and see if this has >>>>>>> something to do with my lack of performance with my dataset. >>>>>>> >>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Perhaps a larger convergence value might help (-d, I believe). >>>>>>>>>> >>>>>>>>>> I'll try that. >>>>>>>> >>>>>>>> There was no significant change while modifying the convergence >>>>>>>> value. >>>>>>>> At least, none was observed during the first three iterations >>>>>>>> which >>>>>>>> lasted the same amount of time than before, more or less. >>>>>>>> >>>>>>>>>>> Is there any chance your data is publicly shareable? Come >>>>>>>>>>> to think >>>>>>>>>>> of >>>>>>>>>>> it, >>>>>>>>>>> with the vector representations, as long as you don't >>>>>>>>>>> publish the >>>>>>>>>>> key >>>>>>>>>>> (which >>>>>>>>>>> terms map to which index), I would think most all data is >>>>>>>>>>> publicly >>>>>>>>>>> shareable. >>>>>>>>>> >>>>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly >>>>>>>>>> shareable? As in user-permissions to access/read/write the data? >>>>>>>>> >>>>>>>>> As in post a copy of the SequenceFile somewhere for download, >>>>>>>>> assuming you >>>>>>>>> can. Then others could presumably try it out. >>>>>>>> >>>>>>>> My bad. Of course it is: >>>>>>>> >>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2 >>>>>>>> >>>>>>>> That's the ~62Mb SequenceFile sample I've using, in >>>>>>> SparseVector> logical format. >>>>>>>> >>>>>>>>> That does seem like an awfully long time for 62 MB on a 6 node >>>>>>>>> cluster. How many >terations are running? >>>>>>>> >>>>>>>> I'm running the whole thing with a 20 iterations cap. Every >>>>>>>> iteration >>>>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, >>>>>>>> took >>>>>>>> around 3hs to complete: >>>>>>>> >>>>>>>> Hadoop job_200907221734_0001 >>>>>>>> Finished in: 1mins, 42sec >>>>>>>> >>>>>>>> Hadoop job_200907221734_0004 >>>>>>>> Finished in: 2hrs, 34mins, 3sec >>>>>>>> >>>>>>>> Hadoop job_200907221734_0005 >>>>>>>> Finished in: 2hrs, 59mins, 34sec >>>>>>>> >>>>>>>>> How did you generate your initial clusters? >>>>>>>> >>>>>>>> I generate the initial clusters via the RandomSeedGenerator >>>>>>>> setting a >>>>>>>> 'k' value of 200. This is what I did to initiate the process >>>>>>>> for the >>>>>>>> first time: >>>>>>>> >>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data >>>>>>>> input/user.data >>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data >>>>>>>> init/user.data >>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar >>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i >>>>>>>> input/user.data -c >>>>>>>> init -o output -r 32 -d 0.01 -k 200 >>>>>>>> >>>>>>>>> Where are the iteration jobs spending most of their time (map vs. >>>>>>>>> reduce) >>>>>>>> >>>>>>>> I'm tempted to say map here, but their spent time is rather >>>>>>>> comparable, actually. Reduce attempts are taking an hour and a >>>>>>>> half to >>>>>>>> end (average), and so are Map attempts. Here are some >>>>>>>> representative >>>>>>>> examples from the web UI: >>>>>>>> >>>>>>>> reduce >>>>>>>> attempt_200907221734_0002_r_000006_0 >>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec) >>>>>>>> >>>>>>>> map >>>>>>>> attempt_200907221734_0002_m_000000_0 >>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec) >>>>>>>> >>>>>>>> Perhaps, there's some inconvenient in the way I create the >>>>>>>> SequenceFile? I could share the JAVA code as well, if required. >>>>>>>> >>>>>>> >>>>>> >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>>> using >>>> Solr/Lucene: >>>> http://www.lucidimagination.com/search >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>> using >>> Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > >