Mailing-List: contact user-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Users List" <user@commons.apache.org>
Received-SPF: pass (nike.apache.org: domain of thomas.neidhart@gmail.com
 designates 209.85.216.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CEA838CF.48F7%email@thorstenschaefer.de>
References: <CEA838CF.48F7%email@thorstenschaefer.de>
Date: Wed, 13 Nov 2013 14:50:59 +0100
Message-ID: 
 <CAJSjvwuM20e9hS3=wa0VMDANXY5hCDNQu0ud-oiE0tgSuyACwA@mail.gmail.com>
Subject: Re: [MATH] Restricted hierarchical clustering
From: Thomas Neidhart <thomas.neidhart@gmail.com>
To: Commons Users List <user@commons.apache.org>
Content-Type: multipart/alternative; boundary=047d7bdc7b984d10da04eb0f41f0

--047d7bdc7b984d10da04eb0f41f0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Thorsten,

this sounds like a very specific use-case of a hierarchical clustering.
I could imagine the following way to achieve it:

 * first cluster all data points with kmeans, with a k=3D50 as you would li=
ke
to have 50 clusters on level 2
 * take the 50 clusters and feed them into a HAC like algorithm which
finishes when the number of level 1 clusters have been formed ( =3D 10 )

I did experiment with this, see the code below. The test is very simple, I
create sequences from the normal ascii alphabet, AA .. ZZ
For the distance measure, I created a special variant of the Euclidean
distance by taking the position of the sequence into account, e.g. AA is
more similar to AB than to BA.
The result of kmeans is directly used to bootstrap the HAC clustering.

Is this something you had in mind?


    public static class Sequence implements Clusterable {

        private char[] sequence;

        public Sequence(char[] seq) {
            sequence =3D seq;
        }

        public double[] getPoint() {
            double[] point =3D new double[ sequence.length ];
            for ( int i =3D 0; i < point.length; i++ ) {
                point[i] =3D sequence[i] - 'A';
            }
            return point;
        }

        @Override
        public String toString() {
            return String.valueOf(sequence);
        }
    }

    public static class SequenceDistance implements DistanceMeasure {

        public double compute(double[] a, double[] b) {
            double dist =3D 0;
            for ( int i =3D 0, j =3D 2; i < a.length; i++ ) {
                double diff =3D a[i] - b[i];
                dist +=3D FastMath.pow( 10, j ) * diff * diff;

                --j;
                if ( j < 1 ) {
                    j =3D 1;
                }
            }
            return FastMath.sqrt(dist);
        }

    }

    public static void main(String[] args) {

        List<Sequence> points =3D new ArrayList<Sequence>();

        for ( char a =3D 'A'; a <=3D 'Z' ; a++ ) {
            for ( char b =3D 'A'; b <=3D 'Z'; b++ ) {
                points.add(new Sequence( new char[] { a, b } ) );
            }
        }

        KMeansPlusPlusClusterer<Sequence> kmeans =3D new
KMeansPlusPlusClusterer<Sequence>(50, 100, new SequenceDistance());
        List<? extends Cluster<Sequence>> cluster =3D kmeans.cluster(points=
);

        Set<Cluster<Sequence>> currentClusters =3D new
HashSet<Cluster<Sequence>>(cluster);

        while (currentClusters.size() > 10 ) {
          Cluster<Sequence> a =3D null;
          Cluster<Sequence> b =3D null;
          double minDistance =3D Double.MAX_VALUE;
          int i =3D 0;
          for (Cluster<Sequence> clusterA : currentClusters) {
              int j =3D 0;
              for (Cluster<Sequence> clusterB : currentClusters) {
                  if (j++ <=3D i) {
                      continue;
                  }
                  double distance =3D maxDistance(clusterA, clusterB,
kmeans.getDistanceMeasure());
                  if (distance < minDistance) {
                      a =3D clusterA;
                      b =3D clusterB;
                      minDistance =3D distance;
                  }
              }
              i++;
          }

          currentClusters.remove(a);
          currentClusters.remove(b);
          Cluster<Sequence> merge =3D new
HierarchicalCluster<Sequence>(minDistance, a, b);
          currentClusters.add(merge);
        }

        for ( Cluster<Sequence> c : currentClusters ) {
            System.out.println(c.getPoints());
            printLeafs(c);
            System.out.println("---");
        }
    }

    public static <T extends Clusterable> double maxDistance(Cluster<T> a,
Cluster<T> b, DistanceMeasure dm) {
        double maxDistance =3D Double.MIN_VALUE;
        for (final T pA : a.getPoints()) {
            for (final T pB : b.getPoints()) {
                double d =3D dm.compute(pA.getPoint(), pB.getPoint());
                if (d > maxDistance) {
                    maxDistance =3D d;
                }
            }
        }
        return maxDistance;
    }

    public static <T extends Clusterable> void printLeafs(Cluster<T>
cluster) {
        if ( cluster instanceof HierarchicalCluster) {
            HierarchicalCluster hc =3D (HierarchicalCluster<T>) cluster;
            //System.out.println("Cluster: distance=3D" + hc.getDistance() =
+
" " + cluster.getPoints());

            if (hc.getLeftChild() !=3D null) {
                printLeafs(hc.getLeftChild());
            }
            if (hc.getRightChild() !=3D null) {
                printLeafs(hc.getRightChild());
            }

        } else {
            System.out.println("ClusterLeaf: " +
cluster.getPoints());
        }

    }


On Tue, Nov 12, 2013 at 11:58 PM, email@thorstenschaefer.de <
email@thorstenschaefer.de> wrote:

> I saw Thomas=92 patch in https://issues.apache.org/jira/browse/MATH-959wh=
ich aims to add support for HAC to commons-math. However, I am just faced
> with a use case and wonder if/how this could be done either with existing
> methods or the proposed HAC algorithm there.
>
> Lets assume we have items to 1000 cluster. Each item represents a
> sequence, e.g. AB, AC, AD, =85, BA, BB, BC, =85, ZA, =85, ZZ and I can as=
sign
> data points  to each item which can be used to calculate their
> similarity/distance. My goal is to create 50 clusters containing all
> sequences =96 this can be done pretty straight forward using KMeans++.
> However, lets assume we want a hierarchical cluster,  with 10 clusters at
> level 1 and 50 at level 2. At level one, I have the restriction that the
> first element in the sequence needs to be assigned to a unique cluster,
> e.g., the structure should look something like this:
> Cluster1: A, B, C
> Cluster1.1: AA, AC, AD, AE, =85, BD, BE, BF, =85 CA
> Cluster1.2: AB,BA,BC,BD,CB,CC,CD, =85
> =85
> Cluster1.7: AY,AZ,BZ,CZ
> // cluster 1 has 7 subclusters.
> Cluster2: D, E, F
> =85
> Cluster3: G
> Cluster3.1: GA,GB =85, GU
> Cluster3.2: GV, GW,=85 GZ
> // note that cluster 3 has only 2 sub clusters
> Cluster4: H, I
> =85
> Cluster 10: W, X, Y, Z
> // all sub clusters from cluster1 to cluster10 should add up to 50
>
> Hence, all sequences in a cluster in level 2 need to have its sequence
> prefix in the parent cluster. Furthermore, even though I want 10 clusters
> on level 1 and 50 on level 2, it does not mean that each level 1 cluster
> should necessarily have 5 child clusters.
>
> I hope its clear enough to get the general restriction I want to ensure
> and I wonder how this could be implemented using the clustering algorithm=
s
> in commons-math.
>
> Cheers,
>
> Thorsten
>

--047d7bdc7b984d10da04eb0f41f0--