Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 67BB390AD for ; Thu, 15 Mar 2012 05:26:13 +0000 (UTC) Received: (qmail 32227 invoked by uid 500); 15 Mar 2012 05:26:12 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 32014 invoked by uid 500); 15 Mar 2012 05:26:08 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 31993 invoked by uid 99); 15 Mar 2012 05:26:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 05:26:08 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ramon_wang@hotmail.com designates 65.54.190.22 as permitted sender) Received: from [65.54.190.22] (HELO bay0-omc1-s11.bay0.hotmail.com) (65.54.190.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 05:25:59 +0000 Received: from BAY170-W116 ([65.54.190.61]) by bay0-omc1-s11.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 14 Mar 2012 22:25:39 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_5b716f8b-775c-4ad2-a6dd-332d8cebaf35_" X-Originating-IP: [114.242.126.130] From: WangRamon To: Subject: What will be a better value for T1 and T2 of a CosineDistanceMeasure Date: Thu, 15 Mar 2012 13:25:38 +0800 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 15 Mar 2012 05:25:39.0022 (UTC) FILETIME=[0EB726E0:01CD026C] X-Virus-Checked: Checked by ClamAV on apache.org --_5b716f8b-775c-4ad2-a6dd-332d8cebaf35_ Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: 8bit Hi All I'm tunning the cluster number of some news input with CosineDistanceMeasure, the input data is about 11000 rows, so i tried different settings for t1 and t2, here is a list: 1) with t1: 0.6 t2: 0.9, i got Reduce output records=60 2) with t1: 0.6 t2: 0.8, i got Reduce output records=868 3) with t1=0.6 and t2=0.7, i got Reduce output records=3374 I expect the reduce output (the cluster number) should be less than 100 and the first one just matched what i was thinking, but what supprised me is the test values for t2, so my understanding is that cos(25) is about 0.9 and cos(35) is about 0.8 (cos(90) == 0.0), so if i set cos(35) as t2, it should generate less cluster number than cos(25) as t2, because it means two vector is much more different, the angle between them is larger. Did I miss something? Thanks in advance. Cheers Ramon --_5b716f8b-775c-4ad2-a6dd-332d8cebaf35_--