Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 55719 invoked from network); 1 Mar 2011 10:22:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Mar 2011 10:22:27 -0000 Received: (qmail 54198 invoked by uid 500); 1 Mar 2011 10:22:27 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 53645 invoked by uid 500); 1 Mar 2011 10:22:23 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 53636 invoked by uid 99); 1 Mar 2011 10:22:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2011 10:22:21 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sajmmon@o2.pl designates 193.17.41.132 as permitted sender) Received: from [193.17.41.132] (HELO moh1-ve2.go2.pl) (193.17.41.132) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2011 10:22:15 +0000 Received: from moh1-ve2.go2.pl (unknown [10.0.0.132]) by moh1-ve2.go2.pl (Postfix) with ESMTP id 5EF9410651A9 for ; Tue, 1 Mar 2011 11:21:53 +0100 (CET) Received: from o2.pl (unknown [10.0.0.120]) by moh1-ve2.go2.pl (Postfix) with SMTP for ; Tue, 1 Mar 2011 11:21:53 +0100 (CET) Subject: =?UTF-8?Q?RE:_T1_and_T2_in_Canopy?= From: =?UTF-8?Q?Szymon_Chojnacki?= To: user@mahout.apache.org In-Reply-To: <99CF5A2B2A1D9542A589C5F5EBD3DA0304004FFABC@rock.narus.com> References: <7ce26d20.eadce55.4d6bfda1.f295@o2.pl> <99CF5A2B2A1D9542A589C5F5EBD3DA0304004FFABC@rock.narus.com> Mime-Version: 1.0 Message-ID: <33aa68a7.33af3b0c.4d6cc8bf.6a8b4@o2.pl> Date: Tue, 01 Mar 2011 11:21:51 +0100 X-Originator: 213.135.36.107 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thank=20you=20Jeff=20for=20your=20advice, I=20think=20that=20the=20problems=20I=20encounter=20are=20characteristic=20= for=20the=20structure=20of=20our=20dataset.=20The=20cardinality=20of=20th= e=20vectors=20is=2020K,=20whereas=20an=20average=20number=20of=20non-zero= =20coordinates=20is=20~50.=20I=20checked=20with=20a=20sample=20that=20on=20= average=2012%=20of=20the=20distances=20between=20the=20vectors=20are=20ma= ximum=20(i.e.=20there=20is=20no=20overlap=20in=20the=20non-zero=20coordin= ates).=20Moreover,=20the=20same=20values=20of=20T1=20and=20T2=20are=20use= d=20in=20mappers=20and=20in=20a=20reducer.=20Which=20imposes=20another=20= challenge=20as=20the=20distances=20among=20the=20centroids=20transferred=20= to=20the=20reducer=20probably=20have=20different=20distribution=20than=20= the=20distances=20between=20pure=20vectors.=20 The=20process=20blows=20up=20either=20at=20the=20very=20begining=20(too=20= many=20centroids=20are=20created=20in=20mappers)=20or=20after=20the=20map= pers=20transfer=20the=20centroids=20to=20the=20reducer=20(as=20I=20see=20= there=20is=20only=20one=20reducer=20hard-coded=20and=20everything=20has=20= to=20be=20processed=20by=20one=20node) Cheers Szymon=20 Dnia=2028=20lutego=202011=2022:25=20Jeff=20Eastman=20= =20napisa=C5=82(a): >=20Canopy=20can=20be=20difficult=20to=20control=20and=20it=20appears=20y= ou=20may=20have=20found=20a=20use=20case=20for=20not=20enforcing=20T1>T2=20= (we=20don't).=20It=20is=20curious,=20though,=20that=20the=20settings=20yo= u=20have=20chosen=20assign=20points=20to=20canopies=20(distdist>T1)=20in=20th= e=20centroids.=20What=20happens=20if=20you=20set=20T1=3DT2+epsilon;=20T2=3D= 1.9?=20That=20would=20at=20least=20follow=20the=20rules=20and=20give=20yo= u=20the=20same=20number=20of=20clusters,=20but=20it=20would=20also=20add=20= the=20centers=20of=20the=20outliers=20(dist>1.15).=20Is=20this=20where=20= your=20processing=20time=20blows=20up? >=20 >=20-----Original=20Message----- >=20From:=20Szymon=20Chojnacki=20[mailto:sajmmon@o2.pl]=20 >=20Sent:=20Monday,=20February=2028,=202011=2011:55=20AM >=20To:=20user@mahout.apache.org >=20Subject:=20T1=20and=20T2=20in=20Canopy >=20 >=20Hello, >=20 >=20I=20am=20working=20with=20my=20colleague=20Tim=20within=20a=20Mahout-= 588=20project=20(https://issues.apache.org/jira/browse/MAHOUT-588).=20The= =20goal=20of=20the=20project=20is=20to=20compare=20mahout's=20clustering=20= algorithms=20with=20Apache-Mail-Archives=20dataset=20(6=20million=20email= s).=20I=20have=20spent=20last=20few=20days=20trying=20to=20set=20such=20v= alues=20of=20T1=20and=20T2,=20which=20would=20give=20a=20non-trivial=20se= t=20of=20clusters=20(>1=20and=20<=20#=20of=20all=20vectors).=20And=20woul= d=20output=20the=20result=20within=20e.g.=20up=20to=203h. >=20 >=20I=20would=20be=20greatful=20for=20your=20advice,=20as=20the=20only=20= way=20I=20can=20do=20it=20was=20by=20breaking=20the=20rule=20from=20the=20= wiki=20that=20(T1>T1).=20The=20problem=20is=20that=20if=20T1=20is=20large= =20than=20we=20get=20many=20non-empty=20coordinates=20in=20each=20canopy.= =20And=20both=20memory=20and=20cpu=20demand=20grows.=20However,=20setting= =20low=20T1=20results=20in=20low=20T2,=20which=20leads=20to=20large=20num= ber=20of=20canopies.=20And=20the=20same=20problem=20with=20memory=20and=20= cpu. >=20 >=20My=20understanding=20of=20the=20source=20code=20is=20that=20T1=20and=20= T2=20are=20independent.=20So=20I=20set=20T1=3D1.15=20and=20T2=3D1.9.=20Th= is=20setting=20let=20me=20obtain=20~200=20canopies=20after=2040=20mins. >=20 >=20Thank=20you=20in=20advance=20for=20you=20suggestions=20on=20setting=20= T1=20and=20T2,=20and=20the=20importance=20of=20T1>T2=20constraint. >=20 >=20Kind=20regards=20 >=20Szymon=20 >=20 >=20ps. >=20I=20described=20my=20struggle=20in=20detail=20in=20https://issues.apa= che.org/jira/secure/attachment/12472217/mahout-588=5Fcanopy.pdf.=20 >=20 >=20 --=20 Szymon=20Chojnacki http://www.ipipan.eu/~sch/