Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E4F87E3A for ; Fri, 28 Oct 2011 05:30:34 +0000 (UTC) Received: (qmail 50202 invoked by uid 500); 28 Oct 2011 05:30:33 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 49610 invoked by uid 500); 28 Oct 2011 05:30:26 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 49592 invoked by uid 99); 28 Oct 2011 05:30:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2011 05:30:24 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mp2893@gmail.com designates 209.85.210.170 as permitted sender) Received: from [209.85.210.170] (HELO mail-iy0-f170.google.com) (209.85.210.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2011 05:30:19 +0000 Received: by iaeo4 with SMTP id o4so9800192iae.1 for ; Thu, 27 Oct 2011 22:29:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/vxwVtnaA1lyymWNoGBw2R2kJ5KjGLYgeO0A4tFxFKs=; b=YCBY1tLO6weeN35vA2LGCTlt5DIGj/jsM0iaByZjoGxaRyKGNgtDxE1RUqafsd1h1M c0iMzalOrk+Ql4yIqwNd5GYqbXaRgrheN9UzbVEHxctoPL+Rmq31lCD22YouSHZF3ZXh hmhF2/kW70EF0D78qCfZgAJRX8SKtlcuJz3pI= MIME-Version: 1.0 Received: by 10.231.50.201 with SMTP id a9mr520019ibg.1.1319779798862; Thu, 27 Oct 2011 22:29:58 -0700 (PDT) Received: by 10.231.12.72 with HTTP; Thu, 27 Oct 2011 22:29:58 -0700 (PDT) In-Reply-To: References: <99CF5A2B2A1D9542A589C5F5EBD3DA03040D4FA455@rock.narus.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D57AB37@rock.narus.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D57AB41@rock.narus.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D57AB7F@rock.narus.com> Date: Fri, 28 Oct 2011 14:29:58 +0900 Message-ID: Subject: Re: Dirichlet Process Clustering not working From: edward choi To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=00151774036217cb2b04b0552e7b --00151774036217cb2b04b0552e7b Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Okay, I have tested with Reuters set and the result was much better than testing with my news documents. I downloaded Reuters set, made it into sequence file. Then turned it into sparse vector with following arguments: --minDF 2 --maxDFPercent 50 --weight TFIDF --norm 2 -ng 2 -nv Then I did DPC with the same arguments you told me. The total number of documents was 21578. DC-0 had 11187 documents. Seven clusters had zero docs. Rest of the clusters had from 1 to 1189 docs. Very interesting thing is, DC-16,18, 19 have the exact same negative points as before when I did DPC with my own document set. ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= -------------------------------------------------- DC-16 total=3D 0 model=3D DMC:16{n=3D0 c=3D[0:-0.411, 0.003:-0.061, 0.01:1.= 685, 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,................ Top Terms: jersey based =3D> 5.055564881106928 withdrew offer =3D> 4.160793145890344 although said =3D> 4.1069074456260966 confirmed iraqi =3D> 4.016748531705415 force administration =3D> 3.995899196620034 24.6 =3D> 3.9719147317695596 due mostly =3D> 3.9125799367453267 unit british =3D> 3.9048586110602286 trade source =3D> 3.892495010521945 stevens =3D> 3.7816279439782554 DC-18 total=3D 0 model=3D DMC:18{n=3D0 c=3D[0:-0.411, 0.003:-0.061, 0.01:1.= 685, 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,.............. Top Terms: jersey based =3D> 5.055564881106928 withdrew offer =3D> 4.160793145890344 although said =3D> 4.1069074456260966 confirmed iraqi =3D> 4.016748531705415 force administration =3D> 3.995899196620034 24.6 =3D> 3.9719147317695596 due mostly =3D> 3.9125799367453267 unit british =3D> 3.9048586110602286 trade source =3D> 3.892495010521945 stevens =3D> 3.7816279439782554 DC-19 total=3D 0 model=3D DMC:19{n=3D0 c=3D[0:-0.411, 0.003:-0.061, 0.01:1.= 685, 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,........... Top Terms: jersey based =3D> 5.055564881106928 withdrew offer =3D> 4.160793145890344 although said =3D> 4.1069074456260966 confirmed iraqi =3D> 4.016748531705415 force administration =3D> 3.995899196620034 24.6 =3D> 3.9719147317695596 due mostly =3D> 3.9125799367453267 unit british =3D> 3.9048586110602286 trade source =3D> 3.892495010521945 stevens =3D> 3.7816279439782554 ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= -------------------------------------------------- So I'm guessing there is some kind of algorithmic problem since the test sets were different but the same DC-16,18,19 have the same values? Regards, Ed 2011/10/28 edward choi > > I downloaded the most recent version of Mahout from apache SVN. > Using the new arguments, I have tested DPC on my own news documents. (not > reuters set) > > Turns out, it really had great improvements. First of all, documents are > somewhat distributed across 20 clusters. > The total number of documents were 5896. > DC-0 had 1014 documents. DC-1 had 4305 documents. > Nine clusters had zero documents. Rest of the clusters had from 1 to 214 > documents each. > > The quality of the clusters weren't so pretty but I guess that has got to > do with the crude preprocessing step. (raw news documents have links, ads= , > reader comments, etc. etc. etc) > I will know better when I test with build-reuters.sh > > One more thing. Unfortunately there are still some negative values in the > cluster points. > > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------- > DC-16 total=3D 0 model=3D DMC:16{n=3D0 c=3D[0:-1.093, 0.07:-0.891, 0.08:1= .327, > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > Top Terms: > kodak camera =3D> 4.5009259007672835 > player july =3D> 4.216287519075373 > figure mix =3D> 4.139826527167421 > department defense =3D> 4.009974576583582 > remark wednesday =3D> 3.9945681051149564 > counsel infection =3D> 3.886000915158471 > jefferson county =3D> 3.8442975919513667 > jersey say =3D> 3.7821696224124786 > tell couple =3D> 3.7644857721992415 > 3.5 million =3D> 3.743525174300145 > DC-18 total=3D 0 model=3D DMC:18{n=3D0 c=3D[0:-1.093, 0.07:-0.891, 0.08:1= .327, > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > Top Terms: > kodak camera =3D> 4.5009259007672835 > player july =3D> 4.216287519075373 > figure mix =3D> 4.139826527167421 > department defense =3D> 4.009974576583582 > remark wednesday =3D> 3.9945681051149564 > counsel infection =3D> 3.886000915158471 > jefferson county =3D> 3.8442975919513667 > jersey say =3D> 3.7821696224124786 > tell couple =3D> 3.7644857721992415 > 3.5 million =3D> 3.743525174300145 > DC-19 total=3D 0 model=3D DMC:19{n=3D0 c=3D[0:-1.093, 0.07:-0.891, 0.08:1= .327, > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > Top Terms: > kodak camera =3D> 4.5009259007672835 > player july =3D> 4.216287519075373 > figure mix =3D> 4.139826527167421 > department defense =3D> 4.009974576583582 > remark wednesday =3D> 3.9945681051149564 > counsel infection =3D> 3.886000915158471 > jefferson county =3D> 3.8442975919513667 > jersey say =3D> 3.7821696224124786 > tell couple =3D> 3.7644857721992415 > 3.5 million =3D> 3.743525174300145 > > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------- > Among nine clusters which have zero members, above three have negative > values. > Interestingly, all three of them have the exact same values and top terms= . > I wonder what this means. > > Anyway I'll post another thread when I have played around with Reuters se= t. > > Ed > > ps. The runtime has indeed reduced significantly!!! Possibly 100 times > faster as you said. Loved it!! > > 2011/10/20 Jeff Eastman > >> R1186452 commits two small changes that seem to do much better with >> Reuters than before: >> - fixed DistanceMeasureClusterDistribution to generate Gaussian element >> values in the prior clusters. Zero values in previous implementation don= 't >> work with CosineDistanceMeasure. >> - changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.= sh >> - switched -mp to DenseVector since all the prior center elements are >> Gaussian and generally non-zero >> - increased -a0 to 2 >> >> Build-reuters now does a much better job with the wide topic vectors usi= ng >> the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new >> arguments: >> >> $MAHOUT dirichlet \ >> -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \ >> -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \ >> -md >> org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDist= ribution >> \ >> -mp org.apache.mahout.math.DenseVector \ >> -dm org.apache.mahout.common.distance.CosineDistanceMeasure >> >> >> -----Original Message----- >> From: Jeff Eastman [mailto:jeastman@Narus.com] >> Sent: Wednesday, October 19, 2011 9:53 AM >> To: user@mahout.apache.org >> Subject: RE: Dirichlet Process Clustering not working >> >> The pdf() implementation in GaussianCluster is pretty lame. It is >> computing a running product of the element pdfs which, for wide input >> vectors (Reuters is 41,807), always underflows and returns 0. Here's the >> code: >> >> public double pdf(VectorWritable vw) { >> Vector x =3D vw.get(); >> // return the product of the component pdfs >> // TODO: is this reasonable? correct? It seems to work in some cases. >> double pdf =3D 1; >> for (int i =3D 0; i < x.size(); i++) { >> // small prior on stdDev to avoid numeric instability when stdDev= =3D=3D0 >> pdf *=3D UncommonDistributions.dNorm(x.getQuick(i), >> getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001); >> } >> return pdf; >> } >> >> -----Original Message----- >> From: Jeff Eastman [mailto:jeastman@Narus.com] >> Sent: Wednesday, October 19, 2011 9:04 AM >> To: user@mahout.apache.org >> Subject: RE: Dirichlet Process Clustering not working >> >> I agree something is amiss here, but it could be the model is just not >> suitable for this problem. Running with the Reuters dataset, I see all t= he >> points being assigned to C-0 in the very first iteration as you do. I th= ink >> the problem is with the pdf() calculations in the mapper for very wide >> vectors such as we are using. For smaller dimension vectors, DPC appears= to >> be working great. >> >> I'm going to commit the build-reuters.sh enhancements I've added for >> FuzzyK and DPC so we can both use the same platform. I will report more >> progress as I dig in deeper today... >> >> -----Original Message----- >> From: edward choi [mailto:mp2893@gmail.com] >> Sent: Wednesday, October 19, 2011 8:11 AM >> To: user@mahout.apache.org >> Subject: Re: Dirichlet Process Clustering not working >> >> Okay, I've just tried DPC with reuters document set. >> I let the 'build-reuters.sh' create the sequence files and vectors. (Fro= m >> the looks of its dictionary generated by mahout, the number of features >> seemed to be less than 100,000) >> Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha, >> clustering true, no addtional options) >> Below is the result of the clusterdump of clusters-10 >> >> ------------------------------------------------------------------------= ---------------------------------------------------- >> C-0: GC:0{n=3D15745 c=3D[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002, >> 0.05:0.004, 0.07:0.005, 0.07 >> Top Terms: >> said =3D> 1.6577128281476725 >> mln =3D> 1.2455441154347937 >> dlrs =3D> 1.1173752482257673 >> 3 =3D> 1.042824193090437 >> pct =3D> 1.0223684722334667 >> reuter =3D> 0.9934255143959358 >> C-1: GC:1{n=3D0 c=3D[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711, >> 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10: >> Top Terms:.... >> C-10: GC:10{n=3D0 c=3D[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672, >> 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0 >> Top Terms:.... >> C-11: GC:11{n=3D0 c=3D[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760, >> 0.05:-0.343, 0.07:0.286, 0.077:1.179, >> Top Terms:.... >> >> ------------------------------------------------------------------------= ---------------------------------------------------- >> I guess the same thing happened again. So the document set is not the >> problem. Something is definitely wrong with DPC. >> Interesting thing is that the first cluster point does not have a single >> negative value in it. >> Rest of the cluster points have a lot of negative values. So I guess thi= s >> phenomenon has something to do with the first cluster hogging all the >> documents. >> Any comments on this result? >> (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post anoth= er >> thread when I am done with that). >> >> Regards, >> Ed >> >> >> > --00151774036217cb2b04b0552e7b--