mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 18:05:45 GMT
> I'm not sure why testing with Random vectors would be all that useful other than it shows
it > runs.  I wouldn't expect anything useful to come out of it, though.

Well... my point was that it really doesn't matter how you create the
Vectors: it's the size of the final file/s that's relevant. Then
again, that IS the problem behind all: it runs - and that's about all
it does, for now.

> How did you create your SeqFile?  From what I can tell from Ted, it is important to get
the > norms and distance measures lined up.

I created the file by using the random-vector-generator methods above
and the ClusteringUtils class in the project. Should the vectors be
mandatorily normalized? If so, I can tell mines aren't. Should
normalize() be called before appending a vector to the output?

> Hmm, some profiling shows the pain is in the distance calculation for emitPointToNearestCluster.

I may be wrong, but I think that's the only method being called during
a map phase (once per vector in the file/s). From a quick glance at
it, may I suggest these simple changes?

    Cluster nearestCluster = null;
    double nearestDistance = Double.MAX_VALUE;
    double distance  = 0;
    for (Cluster cluster : clusters) {
      distance = measure.distance(point, cluster.getCenter());
      if (distance < nearestDistance) {
        nearestCluster = cluster;
        nearestDistance = distance;
      }
    }

Extract the distance variable outside the loop, initialize it with 0,
and eliminate the null comparison. That is one less check to perform
for each iteration.

On Mon, Jul 27, 2009 at 1:55 PM, Shashikant Kore<shashikant@gmail.com> wrote:
> On Mon, Jul 27, 2009 at 10:11 PM, Grant Ingersoll<gsingers@apache.org> wrote:
>>
>> Not following.  The distance calc stuff is irrespective of the type of
>> Vector.  I was referring to the centroid length square (I think you called
>> it the triangle inequality) stuff that Shashikant added on MAHOUT-121.  We
>> use it for testing convergence, but not for other distance calculations.  I
>> haven't looked to see if it is applicable yet, but it seems like it should
>> be.
>>
>
> Grant,
>
> Yes, that part of the patch is missing.  In my original patch, I had
> modified the  emitPointToNearestCluster() in kmeans/Cluster.java to
> calculate distance between document and centroids of various clusters.
>  (There is no triangle inequality code, though.)  In the later patches
> I don't see that code.
>
> I had reviewed the final patch, but I missed out on this one.  I
> think, I only ran Canopy and not K-means. Incidentally, I am
> hopelessly out of date with trunk as recently I have not worked on
> this.  BTW, I haven't really followed this thread in depth. So, I
> might be speaking out of context here. Apologies.
>
> --shashi
>

Mime
View raw message