Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C1B4EF35 for ; Wed, 27 Feb 2013 16:05:44 +0000 (UTC) Received: (qmail 7626 invoked by uid 500); 27 Feb 2013 16:05:42 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 7586 invoked by uid 500); 27 Feb 2013 16:05:42 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 7577 invoked by uid 99); 27 Feb 2013 16:05:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Feb 2013 16:05:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of srowen@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Feb 2013 16:05:34 +0000 Received: by mail-la0-f48.google.com with SMTP id fq13so748864lab.35 for ; Wed, 27 Feb 2013 08:05:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=0DJeFXs9tIyLTyAPYWlcLUBmamjxmj9Mq57OrFYQdFI=; b=1Hhn31/1T1UxhCdGroXONrxQVxeg0URRC+dNt84DT+7FwXArbLii3KHgDIqXH73fKj Lp+NWwC5g0nG2PyupnSERzNtOQJaySquTNYLN38kRs4sbA14UyO6+E4EyzU0N5qhX9jD ykJ9MVfBAI3EWad8BVqN96t2MwKDIyplXY1LDJSizBuMFIvDXAIPdPR0Vg87hwYH7VaI EBX6gIA1VYXXXpXqbPBQeT2w0zPcrnGwGmft1By29260Bn8TkllYqn7otbfasn0yIKFa Uf7lTnuZWsUY1vKGa+CXCQpYwfov7E45f0oXrx1ILFi3j3xg3viV6uPZTeu46U388tUG Nc1g== MIME-Version: 1.0 X-Received: by 10.152.46.131 with SMTP id v3mr2429430lam.57.1361981113836; Wed, 27 Feb 2013 08:05:13 -0800 (PST) Received: by 10.112.34.175 with HTTP; Wed, 27 Feb 2013 08:05:13 -0800 (PST) In-Reply-To: References: Date: Wed, 27 Feb 2013 16:05:13 +0000 Message-ID: Subject: Re: Vector distance within a cluster From: Sean Owen To: Mahout User List Content-Type: multipart/alternative; boundary=bcaec55403ea7b298604d6b6f0f1 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec55403ea7b298604d6b6f0f1 Content-Type: text/plain; charset=UTF-8 A common measure of cluster coherence is the mean distance or mean squared difference between the members and the cluster centroid. It sounds like this is the kind of thing you're measuring with this all-pairs distances. That could be a measure too; I've usually seen that done by taking the maximum such intracluster distance, the 'diameter'. To answer Ted's question -- you're measuring internal consistency. You're not trying to find clusters that match some external standard that says these 100 docs should cluster together, etc. I'm speaking off the cuff, but I think the idea was that L1/Manhattan distance may give you clusters that tend to spread out over few rather than more dimensions, and so that may make them more interpretable -- because they will tend to be nearly identical in the other several dimensions and those homogenous dimensions tell you what they're "about". The reason is that L1 is "indifferent" across dimensions -- moving a unit in any dimension makes you a unit further/closer from another point -- while in L2 moving along a dimension where you are already close does little. On Wed, Feb 27, 2013 at 3:23 PM, Chris Harrington wrote: > Hmmm, you may have to dumb things down for me here. I have don't have much > of a background in the area of ML and I'm just piecing things together and > learning as I go. > So I don't really understand what you mean by "Coherence against an > external standard? Or internal consistency/homogeneity?" or "One thought > along these lines is to add L_1 regularization to the k-means algorithm." > Is L_1 regularization the same as manhattan distance? > > That aside I'm outputting a file with the top terms and the text of 20 > random documents that ended up in that cluster and eyeballing that, not > very high-tech or efficient but it was the only way I knew to make a > relevance judgment on a cluster topic. For example If the majority of the > samples are sport related and 82.6% of the vector distances in my cluster > are quite similar I'm happy to call that cluster sport. > --bcaec55403ea7b298604d6b6f0f1--