mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Meanshift clustering
Date Tue, 25 Jan 2011 05:28:30 GMT
Sure, please. Please include hints about the various cases where it
works (or does not).

Thanks!

On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <vavasilev@gmail.com> wrote:
> Hello,
>
> In fact my estimation technique works only for the Meanshift algorithm,
> because in it the centers of the canopies are moving around. With pure
> Canopy clustering I don't think it will work.
>
> I spent some time trying to realize the idea of different kernel profiles
> with the meanshift clustering algorithm and here are the results:
> 1. I changed a little bit the original algorithm. Previously when one
> cluster "touched" other clusters its centroid was computed only based on the
> centroids of the clusters it touched, but not based on his centroid itself.
> Now befor calculating the shift I added one more line which makes the
> cluster observe his personal centroid:
>
> public boolean shiftToMean(MeanShiftCanopy canopy) {
>    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
>    canopy.computeConvergence(measure, convergenceDelta);
>    canopy.computeParameters();
>    return canopy.isConverged();
>  }
>
> With this modification also the problem with convergence of the algorithm
> (that I described above) disappeared, although the number of iterations
> until convergence increased slightly.
> This change was necessary in order to introduce the other types of "soft"
> kernels.
>
> 2. I introduced IKernelProfile interface which has the methods:
> public double calculateValue(double distance, double h);
>    public double calculateDerivativeValue(double distance, double h);
>
> 3. I created 2 implementations:
> TriangularKernelProfile with calculated value:
> @Override
>    public double calculateDerivativeValue(double distance, double h) {
>        return (distance < h) ? 1.0 : 0.0;
>    }
>
> and NormalKernelProfile with calculated value:
> @Override
>    public double calculateDerivativeValue(double distance, double h) {
>        return UncommonDistributions.dNorm(distance, 0.0, h);
>    }
>
> 4. I modified the core for merging canopies:
>
> public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy>
> canopies) {
>    MeanShiftCanopy closestCoveringCanopy = null;
>    double closestNorm = Double.MAX_VALUE;
>    for (MeanShiftCanopy canopy : canopies) {
>      double norm = measure.distance(canopy.getCenter(),
> aCanopy.getCenter());
>      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
>      if(weight > 0.0)
>      {
>          aCanopy.touch(canopy, weight);
>      }
>      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> closestNorm))) {
>        closestNorm = norm;
>        closestCoveringCanopy = canopy;
>      }
>    }
>    if (closestCoveringCanopy == null) {
>      canopies.add(aCanopy);
>    } else {
>      closestCoveringCanopy.merge(aCanopy);
>    }
>  }
>
> 5. I modified MeanShiftCanopy
> void touch(MeanShiftCanopy canopy, double weight) {
>    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
>    observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size()));
>  }
>
> 6. I modified some other files which were necessary to compile the code and
> for the tests to pass
>
> With the so changed algorithm I had the following observations:
>
> 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> content is more intermixed
> 2. As there is no convergence problem any more I found the following
> procedure for estimating T2 and convergence delta:
>  - bind convergence delta = T2 / 2
>  - When you decrease T2 the number of iterations increases and the number of
> resulting clusters after convergence decreases
>  - You come to a moment when you decrease T2 the number of iterations
> increases, but the number of resulting clusters remains unchanged. This is
> the point with the best value for T2
> 3. In case of Normal kernel what you give as T1 is in fact the standard
> deviation of a normal distribution, so the radius of the window will be T1^2
>
> If you are interested I can send the code.
>
> Regards, Vasil
>
> On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <goksron@gmail.com> wrote:
>
>> Vasil-
>>
>> Would you consider adding your estimation algorithm to this patch?
>> https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> The estimator in there now is stupid- a real one would make the Canopy
>> algorithms orders of magnitude more useful.
>>
>> Lance
>>
>> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <vavasilev@gmail.com>
>> wrote:
>> >
>> >>
>> >> dimension 1: Using linear regression with gradient descent algorithm I
>> find
>> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> straight
>> >> line
>> >> dimension 2: Knowing the approximating line (from the linear regression)
>> I
>> >> count how many times this line gets crossed by the original signal. This
>> >> helps in separating the cyclic data from all the rest
>> >> dimension 3: What is the biggest increase/decrease of a single signal
>> line.
>> >> This helps find shifts
>> >>
>> >> So to say - I put a semantics for the data that are to be clustered (I
>> >> don't
>> >> know if it is correct to do that, but I couldn't think of how an
>> algorithm
>> >> could cope with the task without such additional semantics)
>> >>
>> >
>> > It is very common for feature extraction like this to be the key for
>> > data-mining projects.   Such features are absolutely critical for most
>> time
>> > series mining and are highly application dependent.
>> >
>> > One key aspect of your features is that they are shift invariant.
>> >
>> >
>> >> Also I developed a small swing application which visualizes the
>> clustered
>> >> signals and which helped me in playing with the algorithms.
>> >>
>> >
>> > Great idea.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message