Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of bimargulies@gmail.com
 designates 209.85.219.226 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=KLwqz2IAVTaX6Md0o7h6foFli65Pzfz2B0hLbDjvNOznYXkKGFBstArvPsa9yA4hwx
         IkkPLkVqXkDpSZuREkaUP1zaIGuqL7bL9bfsjP5pXOIgp/tde/i8EIfnEAG1dILFjJKH
         G3XqTokkKHA7thjjlz2TK2E7CXoemPCspXnDM=
MIME-Version: 1.0
In-Reply-To: <c7d45fc70907271842i6c4fc8e7x84a6ab868464256b@mail.gmail.com>
References: <00FCCD05-104D-41BD-8BEA-3E2120BE8189@apache.org>
	 <c7d45fc70906162351p485acf76pd2497497393add9d@mail.gmail.com>
	 <31902868-296F-4087-89FD-5CAECBD6D394@apache.org>
	 <c7d45fc70907271842i6c4fc8e7x84a6ab868464256b@mail.gmail.com>
Date: Mon, 27 Jul 2009 21:51:49 -0400
Message-ID: <61b5d9410907271851u71013aer5be7a6da2d71b60@mail.gmail.com>
Subject: Re: Validating clustering output
From: Benson Margulies <bimargulies@gmail.com>
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Brown and DiPietro's algorithm for clustering based on entropy is
somewhat infamous for the difficulty of achieving usable performance.
Mike Collins was responsible for a famously speedy version. Having
build one that is just barely fast enough in C++, I wouldn't recommend
trying it in Java. Of course, you aren't proposing that, just
recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<ted.dunning@gmail.com> wrote:
> (vastly delayed response ... huge distractions competing with more than 2
> minutes answers are to blame)
>
> Grant,
>
> For evaluating clustering for symbol sequences:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.56.7275
>
> Most of the other references I have found talk about quality relative to
> gold standard judgments about whether exemplars are in the same class or
> relative to similarity/distinctiveness ratios. =C2=A0Neither is all that
> satisfactory.
>
> My preference is an entropic measure that describes how much of the
> information in your data is captured by the clustering vs how much residu=
al
> info there is.
>
> The other reference I am looking for may be in David Mackay's book. =C2=
=A0The
> idea is that you measure the quality of the approximation by looking at t=
he
> entropy in the cluster assignment relative to the residual required to
> precisely specify the original data relative to the quantized value.
>
> This is also related to trading off signal/noise in a vector quantizer.
>
> David, =C2=A0do you have a moment to talk about this with me? =C2=A0I can=
't free up
> the time to chase these final references and come up with a nice formula =
for
> this. =C2=A0I think you could do it in 10-20 minutes.
>
> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <gsingers@apache.org>wro=
te:
>
>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
>>
>> =C2=A0A principled approach to cluster evaluation is to measure how well=
 the
>>> cluster membership captures the structure of unseen data. =C2=A0A natur=
al
>>> measure
>>> for this is to measure how much of the entropy of the data is captured =
by
>>> cluster membership. =C2=A0For k-means and its natural L_2 metric, the n=
atural
>>> cluster quality metric is the squared distance from the nearest centroi=
d
>>> adjusted by the log_2 of the number of clusters. =C2=A0This can be comp=
ared to
>>> the squared magnitude of the original data or the squared deviation fro=
m
>>> the
>>> centroid for all of the data. =C2=A0The idea is that you are changing t=
he
>>> representation of the data by allocating some of the bits in your origi=
nal
>>> representation to represent which cluster each point is in. =C2=A0If th=
ose bits
>>> aren't made up by the residue being small then your clustering is makin=
g a
>>> bad trade-off.
>>>
>>> In the past, I have used other more heuristic measures as well. =C2=A0O=
ne of
>>> the
>>> key characteristics that I would like to see out of a clustering is a
>>> degree
>>> of stability. =C2=A0Thus, I look at the fractions of points that are as=
signed
>>> to
>>> each cluster or the distribution of distances from the cluster centroid=
.
>>> These values should be relatively stable when applied to held-out data.
>>>
>>> For text, you can actually compute perplexity which measures how well
>>> cluster membership predicts what words are used. =C2=A0This is nice bec=
ause you
>>> don't have to worry about the entropy of real valued numbers.
>>>
>>
>> Do you have any references on any of the above approaches?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>