Hello Thomas,
Thank you for the answer. I hope I will be able to clarify my schedule for the summer in about
a week from now and I will decide whether I should apply to GSoC this year or not. I will
let you know as soon as I can. Until then, I will shortly describe my first ideas below:
1. Spectral clustering [1]  It basically maps the data in a lowerdimensional space (relying
on the eigenvectors of the similarity matrix) and performs (kmeans) clustering there. This
method can resolve a wide variety of problems, regardless of the form of the clusters. It
could be implemented efficiently using the Commons Math linear algebra module.
2. Mean shift algorithm [2]  I didn't grasp all the details of the algorithm yet, but I find
it very interesting. As far as I understand, it has been primarily used in pattern recognition
and computer vision. I discovered it while searching for an algorithm that does not require
the number of clusters as input parameter. I think it would be a good addition to Commons
Math besides DBSCAN, from this point of view.
3. Clustering evaluation methods3.1. The Silhouette Coefficient [3]  accounts for the intracluster
and intercluster distance to assign a score in [1, 1] to a clustering.3.2. External clustering
evaluation [4]  when gold standard is available for the clustered data, it can be used to
asses the performance of a clustering algorithm.
Suggestions are more than welcome. If you have requests from users for specific clustering
algorithms, please let me know.
Best regards,Alina
[1] http://www.informatik.unihamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf[2] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1055330&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1055330[3] http://www.sciencedirect.com/science/article/pii/0377042787901257[4]
http://nlp.stanford.edu/IRbook/html/htmledition/evaluationofclustering1.html
From: Thomas Neidhart <thomas.neidhart@gmail.com>
To: Commons Developers List <dev@commons.apache.org>
Sent: Sunday, February 1, 2015 8:33 PM
Subject: Re: [Math] Contributions to the clustering module (maybe GSoC)
On 02/01/2015 02:06 PM, Alina Ciobanu wrote:
> Hello everyone,
> My name is Alina Ciobanu. I'm a firstyear Ph.D. student in computer science (NLP) at
the Faculty of Mathematics and Computer Science, University of Bucharest, Romania. I am interested
in contributing to the Apache Commons Math library. My idea is to work on the clustering module,
to implement spectral clustering, maybe also the mean shift algorithm, and some clustering
validation methods. Would you please tell me if you think that such a contribution would be
useful to the Commons Math users? If so, I will provide more details about what I have in
mind. Any suggestions are welcome.
> I am also thinking about applying to Google Summer of Code this year. I haven't decided
yet because I am not sure, at this moment, if my schedule for this summer would allow it.
Thus, this question is only in perspective: would anyone from the Commons Math community be
interested in mentoring a GSoC project (on the clustering module, as described above, or on
something related)?
> Best regards,Alina Ciobanu
Hi Alina,
good to hear about your interest on commonsmath. New contributions are
very welcome, and we have indeed several feature requests to add new
clustering algorithms.
I am certainly interested in mentoring you for GSOC, but there are maybe
also others that can help with that here.
Just let us know what you want to do early on so that we can prepare
ourselves.
Thomas

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
