mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Date Mon, 14 Apr 2008 16:19:07 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588622#action_12588622
] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

This is the first real thing I've done with Hadoop. It would be great with some input on how
I have used it. Pretend that DistributedBottomFeed was a driver class.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes
contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told
that it can cause various effects I never encountered. And I believe Huffman solved his problem
by training bottom->up? The thing is, I don't think it is possible to train the tree top->down
using map reduce. I do however think it is possible to train it bottom->up. I would very
much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance
between all instances is usually a good maximum distance to allow between nodes when navigating
the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually
instant if the tree is available in memory or persisted in a smart way. In my experience there
is not much to win from extracting all clusters from start. Also, it usually makes sense to
allow for the user to modify the cluster boundary variables in real time using a slider or
perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc.
It is also not to bad to use secondary classification on the instances to create worm holes
in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search
engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial
clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message