mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Álvaro Pérez Alarcón (JIRA) <>
Subject [jira] [Commented] (MAHOUT-1344) Self-Organizing Map algorithm (batch version)
Date Mon, 07 Oct 2013 10:05:42 GMT


Álvaro Pérez Alarcón commented on MAHOUT-1344:

I agree that engaging with the community from the beginning would have produced better results...
I didn't because since it was a final project, it would have depended on the community rather
than just me, so I chose to wait until I actually had something to provide. This was probably
unwise, since it would have been easier for everyone. That's also why I used the stable releases
rather than snapshots or development versions.

I used the old clustering algorithm because when I started developing the algorithm the current
version was 0.7, and when 0.8 was released, I just ported the code. I wouldn't mind taking
a look at the new clustering framework and see if I can provide an implementation of the SOM
based on it. It shouldn't be hard, now I'm familiar with Mahout's code. Where can I find it,
though? I've been looking at the SVN and I fail to find anything that seems to be a different
clustering framework other than the one I've used, which doesn't have changes newer than 0.8's
release except for one class which wouldn't affect my patch.

I do have the project's writeup, but it's in Spanish. I don't have a tutorial, although I
could write one. What information should it contain? Format of the input files, usage of the
driver and command line arguments, format and names of the output files... anything else?

As for the algorithm itself, the SOM algorithm is a clustering algorithm that produces results
that can be easily visualized in a graphic way. Since there's an order relation in the clusters,
neighboring clusters will have similar centers, so the input data is associated to clusters
from a region of the map, representing a highly dimensional input space into a low dimensional
space (for instance, a 2D matrix). This way, big datasets can be easily analyzed by an expert
using graphical tools. The batch version was implemented because it produces a faster performance
than the classic version, its behavior is deterministic for a given cluster initialization,
and parallelizing its execution is simpler than the classic version.

This algorithm is used, among other things, in scientific research. For instance, one of the
motivations of the project was the analysis of the observations that will be obtained in the
ESA Gaia mission, which has the purpose of elaborate a large census of celestial bodies, of
which a large amount is expected to fail to be classified used supervised methods (and therefore
unsupervised classification is needed). My project's director works in an investigation group
that was works in the design of the analysis of those objects, and they're using the SOM algorithm.

I think this algorithm would well fit into Mahout, since it's a widely known clustering algorithm
with useful results, but it's not up to me to decide that, of course. Should I take this discussion
to the mailing list? As I said earlier, I'm willing to provide an implementation on the new
clustering framework if it's within my ability. I'm also willing to stick around and help
with any issues related to my implementation.

> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>                 Key: MAHOUT-1344
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Álvaro Pérez Alarcón
>            Priority: Minor
>             Fix For: 0.8
>         Attachments: MAHOUT-1344.patch
> Good morning.
> As part of my final year project, I have implemented a new module for Apache Mahout,
implementing Kohonen's self-organizing map algorithm, in its batch version.
> The work is already done, and I will proceed to submit a patch ASAP. It was developed
over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a Hadoop cluster
to cluster two big datasets. Results can be seen in [this image gallery|].
> The implementation uses the generic clustering algorithms implemented in the ClusterIterator
class. Minor changes were made to this and other related classes to support some of the features,
without affecting the execution of other algorithms.
> The algorithm supports convergence and the ability to resume a work at a given iteration
(mainly, in order to initialize KohonenBatchClusteringPolicy with a given iteration number,
althought it also affects the names of the output directories).

This message was sent by Atlassian JIRA

View raw message