> Well, it does matter to some degree since picking random vectors tends to give you dense vectors whereas text gives you very sparse vectors. > Different patterns of sparsity can cause radically different time complexity for the clustering. I have yet to find a random combination of vectors that actually benefits substantially the performance of kMeans. I have also tried real datasets (like the one I was initially using from large amounts of data defining consumer's buying habits) to no avail. How should a collection of vectors be created to, say, not compromise the algorithm functionality significantly?