mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: [jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
Date Mon, 27 Sep 2010 17:23:43 GMT
  Hi Shannon,

This is very interesting stuff. I was able to download your latest patch 
and, with a half an hour of busy-work editing, was able to make it all 
compile and all the unit tests run. I will attach my revised patch for 
your inspection. Hope this helps.

Cheers,
Jeff


On 9/27/10 10:31 AM, Shannon Quinn wrote:
> It is (mostly) completed as it stands, but a few updates that have been
> pushed out to some of the dependencies (DistributedRowMatrix and
> DistributedLanczosSolver in particular) have broken the patch I had, so I
> have another that's just about ready to commit with these fixes.
>
> Had a big machine learning assignment that was due today, so hopefully I
> should be able to submit this patch this week.
>
> Shannon
>
> On Sun, Sep 26, 2010 at 5:59 AM, Sean Owen (JIRA)<jira@apache.org>  wrote:
>
>>      [
>> https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Sean Owen updated MAHOUT-363:
>> -----------------------------
>>
>>         Fix Version/s: 0.5
>>     Affects Version/s: 0.3
>>           Component/s: Clustering
>>
>> Shannon was this completed? Do we have a final patch to be committed?
>>
>>> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
>>> ------------------------------------------------------------------
>>>
>>>                  Key: MAHOUT-363
>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-363
>>>              Project: Mahout
>>>           Issue Type: Task
>>>           Components: Clustering
>>>     Affects Versions: 0.3
>>>             Reporter: Shannon Quinn
>>>             Assignee: Shannon Quinn
>>>              Fix For: 0.5
>>>
>>>          Attachments: MAHOUT-363.patch, MAHOUT-363.patch,
>> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
>> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
>> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
>> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
>> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch
>>>
>>> Proposal Title: EigenCuts spectral clustering implementation on
>> map/reduce for Apache Mahout (addresses issue Mahout-328)
>>> Student Name: Shannon Quinn
>>> Student E-mail: magsol@gmail.com
>>> Organization/Project:Assigned Mentor:
>>> Proposal Abstract:
>>> Clustering algorithms are advantageous when the number of classes are not
>> known a priori. However, most techniques still require an explicit K to be
>> chosen, and most spectral algorithms' use of piecewise constant
>> approximation of eigenvectors breaks down when the clusters are tightly
>> coupled. EigenCuts[1] solves both these problems by choosing an eigenvector
>> to create a new cluster boundary and iterating until no more edges are cut.
>>> Detailed Description
>>> Clustering techniques are extremely useful unsupervised methods,
>> particularly within my field of computational biology, for situations where
>> the number (and often the characteristics as well) of classes expressed in
>> the data are not known a priori. K-means is a classic technique which, given
>> some K, attempts to label data points within a cluster as a function of
>> their distance (e.g. Euclidean) from the cluster's mean, iterating to
>> convergence.
>>> Another approach is spectral clustering, which models the data as a
>> weighted, undirected graph in some n-dimensional space, and creates a matrix
>> M of transition probabilities between nodes. By computing the eigenvalues
>> and eigenvectors of this matrix, most spectral clustering techniques take
>> advantage of the fact that, for data with loosely coupled clusters, the K
>> leading eigenvectors will identify the roughly piecewise constant regions in
>> the data that correspond to clusters.
>>> However, these techniques all suffer from drawbacks, the two most
>> significant of which are having to choose an arbitrary K a priori, and in
>> the situation of tightly coupled clusters where the piecewise constant
>> approximation on the eigenvectors no longer holds.
>>> The EigenCuts algorithm addresses both these issues. As a type of
>> spectral clustering algorithm it works by constructing a Markov chain
>> representation of the data and computing the eigenvectors and eigenvalues of
>> the transition matrix. Eigenflows, or flow of probability by eigenvector,
>> have an associated half life of flow decay called eigenflow. By perturbing
>> the weights between nodes, it can be observed where bottlenecks exist in the
>> eigenflow's halflife, allowing for the identification of boundaries between
>> clusters. Thus, this algorithm iterates until no more cuts between clusters
>> need to be made, eliminating the need for an a prior K, and conferring the
>> ability to separate tightly coupled clusters.
>>> The only disadvantage of EigenCuts is the need to recompute eigenvectors
>> and eigenvalues at each iterative step, incurring a large computational
>> overhead. This problem can be adequately addressed within the map/reduce
>> framework and on a Hadoop cluster by parallelizing the computation of each
>> eigenvector and its associated eigenvalue. Apache Hama in particular, with
>> its specializations in graph and matrix data, will be crucial in
>> parallelizing the computations of transition matrices and their
>> corresponding eigenvalues and eigenvectors at each iteration.
>>> Since Dr Chennubhotla is currently a member of the faculty at the
>> University of Pittsburgh, I have been in contact with him for the past few
>> weeks, and we both envision and eagerly anticipate continued collaboration
>> over the course of the summer and this project's implementation. His
>> guidance in highlighting the finer points of the underlying theory, coupled
>> with my experience in and knowledge of software engineering, makes this is a
>> project we are both extremely excited about implementing.
>>> Timeline
>>> At the end of each sprint, there should be a concrete, functional
>> deliverable. It may not do much, but what it does will work and have full
>> coverage accompanying unit tests.
>>> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks -
>> familiarization with and dev deployments of Hadoop and Mahout; reading up on
>> documentation, fine-tuning the project plan and requirements. This part will
>> kick into high gear after May 6 (my last final exam and final academic
>> obligation, prior to the actual graduation ceremony), but likely nothing
>> before April 29 (the day of my thesis defense).
>>> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering
>> algorithm and parallelize on Mahout over Hadoop. Preliminary interface
>> allows for dataset selection and visualization of resulting clusters.
>>> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral
>> clustering. Integrate map/reduce framework via Mahout and take advantage of
>> existing core calculation of transition matrices and associated eigenvectors
>> and eigenvalues.
>>> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to
>> EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
>>> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with
>> EigenCuts. Finalize interface for running the algorithm, selecting datasets
>> and visualizing results.
>>> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit
>> tests, fix outstanding bugs.
>>> Other Information
>>> I am finishing up my last semester as a master's student in computational
>> biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at
>> Carnegie Mellon this coming fall. I have worked extensively with clustering
>> techniques as a master's student, as a significant amount of my work has
>> involved bioimage analysis within Dr Robert Murphy's lab. Previous work has
>> involved using HMMs to detect patterns in tuberculosis genomes and use
>> CLUSTALW to cluster those patterns according to geographic location. My
>> master's thesis involves use of matrix completion to infer linear
>> transformations of proteins and their associated subcellular locations
>> across varying cell conditions (drugs, cell lines, etc).
>>> I unfortunately have limited experience with Apache Mahout and Hadoop,
>> but with an undergraduate computer science degree from Georgia Tech, and
>> after an internship with IBM ExtremeBlue, I feel I am extremely adept at
>> picking up new frameworks quickly.
>>> References
>>> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for
>> Spectral Clustering. NIPS 2002.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message