Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81A1A960B for ; Wed, 30 May 2012 20:27:14 +0000 (UTC) Received: (qmail 64813 invoked by uid 500); 30 May 2012 20:27:13 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 64762 invoked by uid 500); 30 May 2012 20:27:13 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 64748 invoked by uid 99); 30 May 2012 20:27:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 May 2012 20:27:13 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [206.188.198.67] (HELO omr2pod1.networksolutionsemail.com) (206.188.198.67) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 May 2012 20:27:03 +0000 Received: from cm-omr2pod1 (mailpod1.networksolutionsemail.com [206.188.198.65]) by omr2pod1.networksolutionsemail.com (8.13.8/8.13.8) with ESMTP id q4UKQfsu012289 for ; Wed, 30 May 2012 16:26:42 -0400 Authentication-Results: cm-omr2pod1 smtp.user=jeastman@windwardsolutions.com; auth=pass (LOGIN) X-Authenticated-UID: jeastman@windwardsolutions.com Received: from [76.189.175.0] ([76.189.175.0:44812] helo=Jeffs-New-MacBook-Pro.local) by cm-omr2pod1 (envelope-from ) (ecelerity 2.2.2.41 r(31179/31189)) with ESMTPA id FC/55-25392-18286CF4; Wed, 30 May 2012 16:26:41 -0400 Message-ID: <4FC68281.9040205@windwardsolutions.com> Date: Wed, 30 May 2012 16:26:41 -0400 From: Jeff Eastman User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Re: Clustering a large crawl References: <4FBE29BA.6050301@mediainsight.info> <4FBE5069.1000401@windwardsolutions.com> <4FC65792.1020801@occamsmachete.com> <4FC67CFF.7090000@windwardsolutions.com> In-Reply-To: Content-Type: multipart/mixed; boundary="------------090302090809090003060200" --------------090302090809090003060200 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit The CosineDistanceMeasure returns 1 - dotProduct / denominator so it is returning the value you note. If the documents are very similar, then their distance will be small and t=0.1 could be too large to distinguish anything but the gross differences between the documents in the corpus. I'd try dropping the t-value until I get at least 50-100 clusters but I have no idea how small that might be. On 5/30/12 4:11 PM, Robert Stewart wrote: > That is a good point. t1/t2 are distance measures but cosine is a similarity measure, so you need to think of it as 1-cosine. > > > > On May 30, 2012, at 4:03 PM, Jeff Eastman wrote: > >> Have you tried much smaller values for t1=t2? Recall that the t-values specify the distance within which a new point is assigned to an existing canopy. In the limit as t -> 0, you should get n clusters, where n is the number of documents in your corpus. >> >> On 5/30/12 1:23 PM, Pat Ferrel wrote: >>> I have about 150,000 docs on which I ran canopy with values for t1 = t2 from 0.1 to 0.95 using the Cosine distance measure. I got results that range from 1.5 docs per cluster to 3. In other words canopy produced a very large number of centroids, which does not seem to represent the data very well. Trying random values for k seems to produce better results but still spotty and hard to judge. I am at the point of giving up on canopy and so wrote a utility to simply iterate k over some values and run the evaluators each time, but there are currently some problems with CDbw (Inter-Cluster Density is always 0.0 for instance). >>> >>> This seems like such a fundamental problem that others must have found a way to get better results. Any suggestions? >>> >>> > > --------------090302090809090003060200--