mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasil Vasilev <>
Subject Re: Meanshift clustering
Date Fri, 21 Jan 2011 08:39:52 GMT
Hi Jeff,

Ok, I can do so - I can write about my experience on the wiki, or may be
better - write it first on a document and then send it to you for a review.

About the vectorization of the data: In the current example each value of
the control signal is assigned to a separate dimension. Having in mind that
this data is sampled from real-world measurements this should mean that the
oscilations should be Gaussian distributed around the ideal line (for
example ideal line could be straight line, or an oscilating line - depends
on the type of signal). That's why I suppose Gaussian cluster with Dirichlet
algorithm worked well in determining the number of clusters. Unfortunately
the data in these clusters were very intermixed, i.e. I wouldn't say it
could cluster the data correctly.

That's why I decided to create another type of vectorization that will
create a good separation of the data and will work well also with the
distance measure based algorithms (like kmeans for example). For each sample
line a produced a vector with 3 dimensions:

dimension 1: Using linear regression with gradient descent algorithm I find
what is the trend of the line, i.e. is it increasing, decreasing or straight
dimension 2: Knowing the approximating line (from the linear regression) I
count how many times this line gets crossed by the original signal. This
helps in separating the cyclic data from all the rest
dimension 3: What is the biggest increase/decrease of a single signal line.
This helps find shifts

So to say - I put a semantics for the data that are to be clustered (I don't
know if it is correct to do that, but I couldn't think of how an algorithm
could cope with the task without such additional semantics)

Also I developed a small swing application which visualizes the clustered
signals and which helped me in playing with the algorithms.

About the implementation of kernel selection in meanshift: I was wondering
what is the best way to test the results. Because with the signals
clustering the algorithm copes almost perfectly. I also didn't find much in
the papers that were cited on the wiki and in the jira about what effects
kernel has in different cases. Have you seen any papers about this? May be
we could try also the algorithm on image data and see the difference

Regards, Vasil

On Thu, Jan 20, 2011 at 8:28 PM, Jeff Eastman <>wrote:

> Hi Vasil,
> Thanks for the great testimonial. Why don't you add your tuning suggestions
> to the wiki? How did your vectorization differ from the example code? If
> your synthetic control results are as good as you say they are you ought to
> post those results too. The current implementation has some scalability
> limitations - as does canopy - but you are not the first to write about its
> positive results. I'd be interested in collaborating on implementing kernel
> selection and it would be nice to be able to compare and contrast the
> algorithm with the original citation on the other points as well. This might
> be another page on the wiki as this implementation is novel afaict.
> I have a pretty big dataset (18 gb, compressed) and will try out mean shift
> on it when I get a chance.
> On 1/20/11 1:27 AM, Vasil Vasilev wrote:
>> Hi Jeff,
>> I used the meanshift algorithm to cluster the synthetic control data (I
>> produced slightly different type of vectorization than what was given in
>> the
>> examples) and it gave the best results compared to all other algorithms.
>> So
>> I am really amazed by what it can do!
>> About (1): One interesting thing that I noted is the influence of T2. It
>> turned out that if you do not select T2 correctly the algorithm may not
>> converge. In my case I had the following situation. At the end (6-th
>> iteration) there were 2 canopies left which were within each-other's T1
>> boundaries, but outside each-other's T2 boundaries. This led to the fact
>> that they were touching each-other on every iteration and switching their
>> centroids. As the distance between the centroids was larger then the
>> convergence delta the algorithm never converged. However when I increased
>> T2
>> to merge these 2 clusters the algorithm converged successfully.
>> In this respect for me the following procedure worked:
>> 1. Select T2 as small as possible (greater then convergence delta)
>> 2. Increase it until the algorithm converges
>> With this approach the algorithm determined also very nicely the correct
>> number of clusters
>> About (2): OK. I will try it and let you know about the result.
>> On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman<
>> >wrote:
>>  On 1/18/11 9:10 AM, Vasil Vasilev wrote:
>>>  Hello Mahout developers,
>>>> Currently I am trying to get more in depth with the clustering
>>>> algorithms
>>>> -
>>>> how they should be used and tuned.
>>>> For this purpose I decided to learn from the source code of the
>>>> different
>>>> implementations.
>>>> In this respect I have the following questions about the Meanshift
>>>> algorithm
>>>> (sorry if it may sound naive, but I am a novice in the area):
>>>> 1. I noted that the way it is implemented is different from the
>>>> straightforward approach that is described in the paper (
>>>> ).
>>>> Later I learned from Jira MAHOUT-15 that this was made to enable
>>>> parallelism. There I also noticed that T2 should be fixed to 1.0.
>>>> In fact for me it seems that T2 should be correlated with the
>>>> convergence
>>>> delta parameter (which by default is 0.5) and should be slightly higher
>>>> then
>>>> it. Is my assumption correct?
>>>>  I think the optimal values for T1 and T2 depend upon the distance
>>> measure
>>> chosen and the nature of the data itself. As this implementation is
>>> really
>>> just an iterative application of Canopy, I left both T parameters
>>> specifiable too. This is not exactly the same algorithm as Mean Shift in
>>> the
>>> paper but it seems to do amazingly well in some cases.
>>>  2. With the current implementation the user has the option to select
>>>> desired
>>>> distance measure, but does not have the flexibility to select a kernel.
>>>> The
>>>> current approach results in a hard-coded conical kernel with radius T1
>>>> and
>>>> no points outside T1 are considered in the path calculation of the
>>>> canopy.
>>>> Is it possible to slightly modify the algorithm (similar to the
>>>> modification
>>>> from kmeans to fuzzy kmeans) where weights are associated with a given
>>>> point
>>>> that would touch the canopy and these weights are drown from the kernel
>>>> function. For example they could be drawn from a normal distribution? Do
>>>> you
>>>> think the possibility for kernel selection could impact positively the
>>>> clustering with meanshift in some cases?
>>>>  I don't know but it is intriguing. Why don't you try it?
>>>  Regards, Vasil

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message