[ https://issues.apache.org/jira/browse/MAHOUT363?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Shannon Quinn updated MAHOUT363:

Description:
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout
(addresses issue Mahout328)
Student Name: Shannon Quinn
Student Email: magsol@gmail.com
Organization/Project:Assigned Mentor:
Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori.
However, most techniques still require an explicit K to be chosen, and most spectral algorithms'
use of piecewise constant approximation of eigenvectors breaks down when the clusters are
tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create
a new cluster boundary and iterating until no more edges are cut.
Detailed Description
Clustering techniques are extremely useful unsupervised methods, particularly within my field
of computational biology, for situations where the number (and often the characteristics as
well) of classes expressed in the data are not known a priori. Kmeans is a classic technique
which, given some K, attempts to label data points within a cluster as a function of their
distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
Another approach is spectral clustering, which models the data as a weighted, undirected graph
in some ndimensional space, and creates a matrix M of transition probabilities between nodes.
By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques
take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors
will identify the roughly piecewise constant regions in the data that correspond to clusters.
However, these techniques all suffer from drawbacks, the two most significant of which are
having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters
where the piecewise constant approximation on the eigenvectors no longer holds.
The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm
it works by constructing a Markov chain representation of the data and computing the eigenvectors
and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector,
have an associated half life of flow decay called eigenflow. By perturbing the weights between
nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for
the identification of boundaries between clusters. Thus, this algorithm iterates until no
more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring
the ability to separate tightly coupled clusters.
The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at
each iterative step, incurring a large computational overhead. This problem can be adequately
addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation
of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations
in graph and matrix data, will be crucial in parallelizing the computations of transition
matrices and their corresponding eigenvalues and eigenvectors at each iteration.
Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh,
I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate
continued collaboration over the course of the summer and this project's implementation. He
has thus far been instrumental in highlighting the finer points of the underlying theory,
and coupled with my experience in and knowledge of software engineering, this is a project
we are both extremely excited about implementing.
Timeline
At the end of each sprint, there should be a concrete, functional deliverable. It may not
do much, but what it does will work and have full coverage accompanying unit tests.
Sprint 0 (April 26  May 23): Work with mentor on any precoding tasks  familiarization with
and dev deployments of Hadoop and Mahout; reading up on documentation, finetuning the project
plan and requirements. This part will kick into high gear after May 6 (my last final exam
and final academic obligation, prior to the actual graduation ceremony), but likely nothing
before April 29 (the day of my thesis defense).
Sprint 1 (2 weeks; May 24  June 6): Implement basic kmeans clustering algorithm and parallelize
on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization
of resulting clusters.
Sprint 2 (3 weeks; June 7  27): Modify kmeans algorithm to spectral clustering. Integrate
map/reduce framework via Mahout and take advantage of existing core calculation of transition
matrices and associated eigenvectors and eigenvalues.
Sprint 3 (3 weeks; June 28  July 18): Augment spectral clustering to EigenCuts. Fully parallelize
with Mahout. Also, midterm evaluations.
Sprint 4 (3 weeks; July 19  August 8): Fix any remaining issues with EigenCuts. Finalize
interface for running the algorithm, selecting datasets and visualizing results.
Sprint 5 (1 week; August 9  15): Tidy up documentation, write final unit tests, fix outstanding
bugs.
Other Information
I am finishing up my last semester as a master's student in computational biology at Carnegie
Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall.
I have worked extensively with clustering techniques as a master's student, as a significant
amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work
has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster
those patterns according to geographic location. My master's thesis involves use of matrix
completion to infer linear transformations of proteins and their associated subcellular locations
across varying cell conditions (drugs, cell lines, etc).
I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate
computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I
feel I am extremely adept at picking up new frameworks quickly.
References
[1] Chakra Chennubhotla and Allan D. Jepson. HalfLives of EigenFlows for Spectral Clustering.
NIPS 2002.
was:
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout
(addresses issue Mahout328)
Student Name: Shannon Quinn
Student Email: magsol@gmail.com
Organization/Project:Assigned Mentor:
Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori.
However, most techniques still require an explicit K to be chosen, and most spectral algorithms'
use of piecewise constant approximation of eigenvectors breaks down when the clusters are
tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create
a new cluster boundary and iterating until no more edges are cut.
Detailed Description
Clustering techniques are extremely useful unsupervised methods, particularly within my field
of computational biology, for situations where the number (and often the characteristics as
well) of classes expressed in the data are not known a priori. Kmeans is a classic technique
which, given some K, attempts to label data points within a cluster as a function of their
distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
Another approach is spectral clustering, which models the data as a weighted, undirected graph
in some ndimensional space, and creates a matrix M of transition probabilities between nodes.
By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques
take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors
will identify the roughly piecewise constant regions in the data that correspond to clusters.
However, these techniques all suffer from drawbacks, the two most significant of which are
having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters
where the piecewise constant approximation on the eigenvectors no longer holds.
The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm
it works by constructing a Markov chain representation of the data and computing the eigenvectors
and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector,
have an associated half life of flow decay called eigenflow. By perturbing the weights between
nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for
the identification of boundaries between clusters. Thus, this algorithm iterates until no
more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring
the ability to separate tightly coupled clusters.
The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at
each iterative step, incurring a large computational overhead. This problem can be adequately
addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation
of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations
in graph and matrix data, will be crucial in parallelizing the computations of transition
matrices and their corresponding eigenvalues and eigenvectors at each iteration.
Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh,
I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate
continued collaboration over the course of the summer and this project's implementation. He
has thus far been instrumental in highlighting the finer points of the underlying theory,
and coupled with my experience in and knowledge of software engineering, this is a project
we are both extremely excited about implementing.
Timeline
At the end of each sprint, there should be a concrete, functional deliverable. It may not
do much, but what it does will work and have full coverage accompanying unit tests.
Sprint 0 (April 26  May 23): Work with mentor on any precoding tasks  familiarization with
and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, finetuning the
project plan and requirements. This part will kick into high gear after May 6 (my last final
exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing
before April 29 (the day of my thesis defense).
Sprint 1 (2 weeks; May 24  June 6): Implement basic kmeans clustering algorithm and parallelize
on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization
of resulting clusters.
Sprint 2 (3 weeks; June 7  27): Modify kmeans algorithm to spectral clustering. Integrate
map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors
and eigenvalues.
Sprint 3 (3 weeks; June 28  July 18): Augment spectral clustering to EigenCuts. Fully parallelize
with Hama. Also, midterm evaluations.
Sprint 4 (3 weeks; July 19  August 8): Fix any remaining issues with EigenCuts. Finalize
interface for running the algorithm, selecting datasets and visualizing results.
Sprint 5 (1 week; August 9  15): Tidy up documentation, write final unit tests, fix outstanding
bugs.
Other Information
I am finishing up my last semester as a master's student in computational biology at Carnegie
Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall.
I have worked extensively with clustering techniques as a master's student, as a significant
amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work
has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster
those patterns according to geographic location. My master's thesis involves use of matrix
completion to infer linear transformations of proteins and their associated subcellular locations
across varying cell conditions (drugs, cell lines, etc).
I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate
computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I
feel I am extremely adept at picking up new frameworks quickly.
References
[1] Chakra Chennubhotla and Allan D. Jepson. HalfLives of EigenFlows for Spectral Clustering.
NIPS 2002.
> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> 
>
> Key: MAHOUT363
> URL: https://issues.apache.org/jira/browse/MAHOUT363
> Project: Mahout
> Issue Type: Task
> Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache
Mahout (addresses issue Mahout328)
> Student Name: Shannon Quinn
> Student Email: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori.
However, most techniques still require an explicit K to be chosen, and most spectral algorithms'
use of piecewise constant approximation of eigenvectors breaks down when the clusters are
tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create
a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within
my field of computational biology, for situations where the number (and often the characteristics
as well) of classes expressed in the data are not known a priori. Kmeans is a classic technique
which, given some K, attempts to label data points within a cluster as a function of their
distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected
graph in some ndimensional space, and creates a matrix M of transition probabilities between
nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering
techniques take advantage of the fact that, for data with loosely coupled clusters, the K
leading eigenvectors will identify the roughly piecewise constant regions in the data that
correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which
are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters
where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering
algorithm it works by constructing a Markov chain representation of the data and computing
the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability
by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing
the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife,
allowing for the identification of boundaries between clusters. Thus, this algorithm iterates
until no more cuts between clusters need to be made, eliminating the need for an a prior K,
and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues
at each iterative step, incurring a large computational overhead. This problem can be adequately
addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation
of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations
in graph and matrix data, will be crucial in parallelizing the computations of transition
matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh,
I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate
continued collaboration over the course of the summer and this project's implementation. He
has thus far been instrumental in highlighting the finer points of the underlying theory,
and coupled with my experience in and knowledge of software engineering, this is a project
we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may
not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26  May 23): Work with mentor on any precoding tasks  familiarization
with and dev deployments of Hadoop and Mahout; reading up on documentation, finetuning the
project plan and requirements. This part will kick into high gear after May 6 (my last final
exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing
before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24  June 6): Implement basic kmeans clustering algorithm and
parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and
visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7  27): Modify kmeans algorithm to spectral clustering. Integrate
map/reduce framework via Mahout and take advantage of existing core calculation of transition
matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28  July 18): Augment spectral clustering to EigenCuts. Fully
parallelize with Mahout. Also, midterm evaluations.
> Sprint 4 (3 weeks; July 19  August 8): Fix any remaining issues with EigenCuts. Finalize
interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9  15): Tidy up documentation, write final unit tests, fix
outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at
Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming
fall. I have worked extensively with clustering techniques as a master's student, as a significant
amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work
has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster
those patterns according to geographic location. My master's thesis involves use of matrix
completion to infer linear transformations of proteins and their associated subcellular locations
across varying cell conditions (drugs, cell lines, etc).
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate
computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I
feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. HalfLives of EigenFlows for Spectral Clustering.
NIPS 2002.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
