Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 62329 invoked from network); 28 Jul 2009 01:51:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 01:51:19 -0000 Received: (qmail 23357 invoked by uid 500); 28 Jul 2009 01:52:23 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 23295 invoked by uid 500); 28 Jul 2009 01:52:23 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 23285 invoked by uid 99); 28 Jul 2009 01:52:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 01:52:23 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bimargulies@gmail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 01:52:10 +0000 Received: by ewy26 with SMTP id 26so3722196ewy.5 for ; Mon, 27 Jul 2009 18:51:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=WQoAKg4WXcWNKjKTXMM/mDUnxdM3GgMIGUNTEW+WRk0=; b=qsnoU564sVz+NrFGK3lZoe1II/Nw+4XztDkV4x43PLdfhB1Vm4eptAd8c68zkpTgaJ VbOTbbTrq3JX2LxwLbowt0pFG3BmY7aHCfGg9aZ2q8xUgvRXjgqVgZtMca0KFCRyr4zO ZtidDQ0HebW3ZQ3dMVw3X6YGMd9ik9002mszw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=KLwqz2IAVTaX6Md0o7h6foFli65Pzfz2B0hLbDjvNOznYXkKGFBstArvPsa9yA4hwx IkkPLkVqXkDpSZuREkaUP1zaIGuqL7bL9bfsjP5pXOIgp/tde/i8EIfnEAG1dILFjJKH G3XqTokkKHA7thjjlz2TK2E7CXoemPCspXnDM= MIME-Version: 1.0 Received: by 10.216.71.82 with SMTP id q60mr1833010wed.169.1248745909384; Mon, 27 Jul 2009 18:51:49 -0700 (PDT) In-Reply-To: References: <00FCCD05-104D-41BD-8BEA-3E2120BE8189@apache.org> <31902868-296F-4087-89FD-5CAECBD6D394@apache.org> Date: Mon, 27 Jul 2009 21:51:49 -0400 Message-ID: <61b5d9410907271851u71013aer5be7a6da2d71b60@mail.gmail.com> Subject: Re: Validating clustering output From: Benson Margulies To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Brown and DiPietro's algorithm for clustering based on entropy is somewhat infamous for the difficulty of achieving usable performance. Mike Collins was responsible for a famously speedy version. Having build one that is just barely fast enough in C++, I wouldn't recommend trying it in Java. Of course, you aren't proposing that, just recommending the bigram entropy metric or something like it. On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning wrote: > (vastly delayed response ... huge distractions competing with more than 2 > minutes answers are to blame) > > Grant, > > For evaluating clustering for symbol sequences: > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.56.7275 > > Most of the other references I have found talk about quality relative to > gold standard judgments about whether exemplars are in the same class or > relative to similarity/distinctiveness ratios. =C2=A0Neither is all that > satisfactory. > > My preference is an entropic measure that describes how much of the > information in your data is captured by the clustering vs how much residu= al > info there is. > > The other reference I am looking for may be in David Mackay's book. =C2= =A0The > idea is that you measure the quality of the approximation by looking at t= he > entropy in the cluster assignment relative to the residual required to > precisely specify the original data relative to the quantized value. > > This is also related to trading off signal/noise in a vector quantizer. > > David, =C2=A0do you have a moment to talk about this with me? =C2=A0I can= 't free up > the time to chase these final references and come up with a nice formula = for > this. =C2=A0I think you could do it in 10-20 minutes. > > On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll wro= te: > >> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote: >> >> =C2=A0A principled approach to cluster evaluation is to measure how well= the >>> cluster membership captures the structure of unseen data. =C2=A0A natur= al >>> measure >>> for this is to measure how much of the entropy of the data is captured = by >>> cluster membership. =C2=A0For k-means and its natural L_2 metric, the n= atural >>> cluster quality metric is the squared distance from the nearest centroi= d >>> adjusted by the log_2 of the number of clusters. =C2=A0This can be comp= ared to >>> the squared magnitude of the original data or the squared deviation fro= m >>> the >>> centroid for all of the data. =C2=A0The idea is that you are changing t= he >>> representation of the data by allocating some of the bits in your origi= nal >>> representation to represent which cluster each point is in. =C2=A0If th= ose bits >>> aren't made up by the residue being small then your clustering is makin= g a >>> bad trade-off. >>> >>> In the past, I have used other more heuristic measures as well. =C2=A0O= ne of >>> the >>> key characteristics that I would like to see out of a clustering is a >>> degree >>> of stability. =C2=A0Thus, I look at the fractions of points that are as= signed >>> to >>> each cluster or the distribution of distances from the cluster centroid= . >>> These values should be relatively stable when applied to held-out data. >>> >>> For text, you can actually compute perplexity which measures how well >>> cluster membership predicts what words are used. =C2=A0This is nice bec= ause you >>> don't have to worry about the entropy of real valued numbers. >>> >> >> Do you have any references on any of the above approaches? >> > > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > http://www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) >