Hello Thomas,
Thank you for the answer. I hope I will be able to clarify my schedule for the summer in about a week from now and I will decide whether I should apply to GSoC this year or not. I will let you know as soon as I can. Until then, I will shortly describe my first ideas below:
1. Spectral clustering [1] - It basically maps the data in a lower-dimensional space (relying on the eigenvectors of the similarity matrix) and performs (k-means) clustering there. This method can resolve a wide variety of problems, regardless of the form of the clusters. It could be implemented efficiently using the Commons Math linear algebra module.
2. Mean shift algorithm [2] - I didn't grasp all the details of the algorithm yet, but I find it very interesting. As far as I understand, it has been primarily used in pattern recognition and computer vision. I discovered it while searching for an algorithm that does not require the number of clusters as input parameter. I think it would be a good addition to Commons Math besides DBSCAN, from this point of view.
3. Clustering evaluation methods3.1. The Silhouette Coefficient [3] - accounts for the intra-cluster and inter-cluster distance to assign a score in [-1, 1] to a clustering.3.2. External clustering evaluation [4] - when gold standard is available for the clustered data, it can be used to asses the performance of a clustering algorithm.
Suggestions are more than welcome. If you have requests from users for specific clustering algorithms, please let me know.
Best regards,Alina
[1] http://www.informatik.uni-hamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf[2] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1055330&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1055330[3] http://www.sciencedirect.com/science/article/pii/0377042787901257[4] http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
From: Thomas Neidhart
To: Commons Developers List
Sent: Sunday, February 1, 2015 8:33 PM
Subject: Re: [Math] Contributions to the clustering module (maybe GSoC)
On 02/01/2015 02:06 PM, Alina Ciobanu wrote:
> Hello everyone,
> My name is Alina Ciobanu. I'm a first-year Ph.D. student in computer science (NLP) at the Faculty of Mathematics and Computer Science, University of Bucharest, Romania. I am interested in contributing to the Apache Commons Math library. My idea is to work on the clustering module, to implement spectral clustering, maybe also the mean shift algorithm, and some clustering validation methods. Would you please tell me if you think that such a contribution would be useful to the Commons Math users? If so, I will provide more details about what I have in mind. Any suggestions are welcome.
> I am also thinking about applying to Google Summer of Code this year. I haven't decided yet because I am not sure, at this moment, if my schedule for this summer would allow it. Thus, this question is only in perspective: would anyone from the Commons Math community be interested in mentoring a GSoC project (on the clustering module, as described above, or on something related)?
> Best regards,Alina Ciobanu
Hi Alina,
good to hear about your interest on commons-math. New contributions are
very welcome, and we have indeed several feature requests to add new
clustering algorithms.
I am certainly interested in mentoring you for GSOC, but there are maybe
also others that can help with that here.
Just let us know what you want to do early on so that we can prepare
ourselves.
Thomas
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org