commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject [MATH] Restricted hierarchical clustering
Date Tue, 12 Nov 2013 22:58:13 GMT
I saw Thomas’ patch in which aims to add support
for HAC to commons-math. However, I am just faced with a use case and wonder if/how this could
be done either with existing methods or the proposed HAC algorithm there.

Lets assume we have items to 1000 cluster. Each item represents a sequence, e.g. AB, AC, AD,
…, BA, BB, BC, …, ZA, …, ZZ and I can assign data points  to each item which can be
used to calculate their similarity/distance. My goal is to create 50 clusters containing all
sequences – this can be done pretty straight forward using KMeans++.
However, lets assume we want a hierarchical cluster,  with 10 clusters at level 1 and 50 at
level 2. At level one, I have the restriction that the first element in the sequence needs
to be assigned to a unique cluster, e.g., the structure should look something like this:
Cluster1: A, B, C
Cluster1.1: AA, AC, AD, AE, …, BD, BE, BF, … CA
Cluster1.2: AB,BA,BC,BD,CB,CC,CD, …
Cluster1.7: AY,AZ,BZ,CZ
// cluster 1 has 7 subclusters.
Cluster2: D, E, F
Cluster3: G
Cluster3.1: GA,GB …, GU
Cluster3.2: GV, GW,… GZ
// note that cluster 3 has only 2 sub clusters
Cluster4: H, I
Cluster 10: W, X, Y, Z
// all sub clusters from cluster1 to cluster10 should add up to 50

Hence, all sequences in a cluster in level 2 need to have its sequence prefix in the parent
cluster. Furthermore, even though I want 10 clusters on level 1 and 50 on level 2, it does
not mean that each level 1 cluster should necessarily have 5 child clusters.

I hope its clear enough to get the general restriction I want to ensure and I wonder how this
could be implemented using the clustering algorithms in commons-math.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message