Return-Path: X-Original-To: apmail-commons-user-archive@www.apache.org Delivered-To: apmail-commons-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C6F4109B8 for ; Wed, 13 Nov 2013 13:51:32 +0000 (UTC) Received: (qmail 62881 invoked by uid 500); 13 Nov 2013 13:51:29 -0000 Delivered-To: apmail-commons-user-archive@commons.apache.org Received: (qmail 62808 invoked by uid 500); 13 Nov 2013 13:51:28 -0000 Mailing-List: contact user-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Users List" Delivered-To: mailing list user@commons.apache.org Received: (qmail 62798 invoked by uid 99); 13 Nov 2013 13:51:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Nov 2013 13:51:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of thomas.neidhart@gmail.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qa0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Nov 2013 13:51:20 +0000 Received: by mail-qa0-f42.google.com with SMTP id ii20so3738404qab.8 for ; Wed, 13 Nov 2013 05:50:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ref6SPNytvo5IrR0Fzb9wpPTpeTW01AiVUejjmZd2hI=; b=DNvOQJE0G461J8CJy09HeP4T8rNY82HEdMAhBEMFTzOHjMHpqUBvuzznbinxDieu7q t336wPZH3NOz9KNDdt8INWC5hjkhL2LyqCGwAknKKuxQ62JNAJ4/p4wr3i58rkqE2Iy5 R/MvGofDudhmIOGfGpvqOZ0kucW9HkY+dmfEp4pwOdjUH9aKKJh5XvJdBkC+JQeYSR7Z qVAoD3qlK5SL0eLbZyWjlLDXXh/AYQkyk/Z0eAUHJxMFwHxams2o8pnLjEAasm6Afj6i i/C5AB6MS5bUCDYX8KBEIODctt8WX6CCxmtxEjbXg3Ly/NKZvDRcz5Zr9f0FvOdsivna 0Znw== MIME-Version: 1.0 X-Received: by 10.49.110.9 with SMTP id hw9mr65924630qeb.4.1384350659442; Wed, 13 Nov 2013 05:50:59 -0800 (PST) Received: by 10.140.85.149 with HTTP; Wed, 13 Nov 2013 05:50:59 -0800 (PST) In-Reply-To: References: Date: Wed, 13 Nov 2013 14:50:59 +0100 Message-ID: Subject: Re: [MATH] Restricted hierarchical clustering From: Thomas Neidhart To: Commons Users List Content-Type: multipart/alternative; boundary=047d7bdc7b984d10da04eb0f41f0 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc7b984d10da04eb0f41f0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Thorsten, this sounds like a very specific use-case of a hierarchical clustering. I could imagine the following way to achieve it: * first cluster all data points with kmeans, with a k=3D50 as you would li= ke to have 50 clusters on level 2 * take the 50 clusters and feed them into a HAC like algorithm which finishes when the number of level 1 clusters have been formed ( =3D 10 ) I did experiment with this, see the code below. The test is very simple, I create sequences from the normal ascii alphabet, AA .. ZZ For the distance measure, I created a special variant of the Euclidean distance by taking the position of the sequence into account, e.g. AA is more similar to AB than to BA. The result of kmeans is directly used to bootstrap the HAC clustering. Is this something you had in mind? public static class Sequence implements Clusterable { private char[] sequence; public Sequence(char[] seq) { sequence =3D seq; } public double[] getPoint() { double[] point =3D new double[ sequence.length ]; for ( int i =3D 0; i < point.length; i++ ) { point[i] =3D sequence[i] - 'A'; } return point; } @Override public String toString() { return String.valueOf(sequence); } } public static class SequenceDistance implements DistanceMeasure { public double compute(double[] a, double[] b) { double dist =3D 0; for ( int i =3D 0, j =3D 2; i < a.length; i++ ) { double diff =3D a[i] - b[i]; dist +=3D FastMath.pow( 10, j ) * diff * diff; --j; if ( j < 1 ) { j =3D 1; } } return FastMath.sqrt(dist); } } public static void main(String[] args) { List points =3D new ArrayList(); for ( char a =3D 'A'; a <=3D 'Z' ; a++ ) { for ( char b =3D 'A'; b <=3D 'Z'; b++ ) { points.add(new Sequence( new char[] { a, b } ) ); } } KMeansPlusPlusClusterer kmeans =3D new KMeansPlusPlusClusterer(50, 100, new SequenceDistance()); List> cluster =3D kmeans.cluster(points= ); Set> currentClusters =3D new HashSet>(cluster); while (currentClusters.size() > 10 ) { Cluster a =3D null; Cluster b =3D null; double minDistance =3D Double.MAX_VALUE; int i =3D 0; for (Cluster clusterA : currentClusters) { int j =3D 0; for (Cluster clusterB : currentClusters) { if (j++ <=3D i) { continue; } double distance =3D maxDistance(clusterA, clusterB, kmeans.getDistanceMeasure()); if (distance < minDistance) { a =3D clusterA; b =3D clusterB; minDistance =3D distance; } } i++; } currentClusters.remove(a); currentClusters.remove(b); Cluster merge =3D new HierarchicalCluster(minDistance, a, b); currentClusters.add(merge); } for ( Cluster c : currentClusters ) { System.out.println(c.getPoints()); printLeafs(c); System.out.println("---"); } } public static double maxDistance(Cluster a, Cluster b, DistanceMeasure dm) { double maxDistance =3D Double.MIN_VALUE; for (final T pA : a.getPoints()) { for (final T pB : b.getPoints()) { double d =3D dm.compute(pA.getPoint(), pB.getPoint()); if (d > maxDistance) { maxDistance =3D d; } } } return maxDistance; } public static void printLeafs(Cluster cluster) { if ( cluster instanceof HierarchicalCluster) { HierarchicalCluster hc =3D (HierarchicalCluster) cluster; //System.out.println("Cluster: distance=3D" + hc.getDistance() = + " " + cluster.getPoints()); if (hc.getLeftChild() !=3D null) { printLeafs(hc.getLeftChild()); } if (hc.getRightChild() !=3D null) { printLeafs(hc.getRightChild()); } } else { System.out.println("ClusterLeaf: " + cluster.getPoints()); } } On Tue, Nov 12, 2013 at 11:58 PM, email@thorstenschaefer.de < email@thorstenschaefer.de> wrote: > I saw Thomas=92 patch in https://issues.apache.org/jira/browse/MATH-959wh= ich aims to add support for HAC to commons-math. However, I am just faced > with a use case and wonder if/how this could be done either with existing > methods or the proposed HAC algorithm there. > > Lets assume we have items to 1000 cluster. Each item represents a > sequence, e.g. AB, AC, AD, =85, BA, BB, BC, =85, ZA, =85, ZZ and I can as= sign > data points to each item which can be used to calculate their > similarity/distance. My goal is to create 50 clusters containing all > sequences =96 this can be done pretty straight forward using KMeans++. > However, lets assume we want a hierarchical cluster, with 10 clusters at > level 1 and 50 at level 2. At level one, I have the restriction that the > first element in the sequence needs to be assigned to a unique cluster, > e.g., the structure should look something like this: > Cluster1: A, B, C > Cluster1.1: AA, AC, AD, AE, =85, BD, BE, BF, =85 CA > Cluster1.2: AB,BA,BC,BD,CB,CC,CD, =85 > =85 > Cluster1.7: AY,AZ,BZ,CZ > // cluster 1 has 7 subclusters. > Cluster2: D, E, F > =85 > Cluster3: G > Cluster3.1: GA,GB =85, GU > Cluster3.2: GV, GW,=85 GZ > // note that cluster 3 has only 2 sub clusters > Cluster4: H, I > =85 > Cluster 10: W, X, Y, Z > // all sub clusters from cluster1 to cluster10 should add up to 50 > > Hence, all sequences in a cluster in level 2 need to have its sequence > prefix in the parent cluster. Furthermore, even though I want 10 clusters > on level 1 and 50 on level 2, it does not mean that each level 1 cluster > should necessarily have 5 child clusters. > > I hope its clear enough to get the general restriction I want to ensure > and I wonder how this could be implemented using the clustering algorithm= s > in commons-math. > > Cheers, > > Thorsten > --047d7bdc7b984d10da04eb0f41f0--