mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <mp2...@gmail.com>
Subject Re: Dirichlet Process Clustering not working
Date Fri, 28 Oct 2011 05:29:58 GMT
Okay, I have tested with Reuters set and the result was much better than
testing with my news documents.

I downloaded Reuters set, made it into sequence file. Then turned it into
sparse vector with following arguments:
--minDF 2 --maxDFPercent 50 --weight TFIDF --norm 2 -ng 2 -nv
Then I did DPC with the same arguments you told me.

The total number of documents was 21578.
DC-0 had 11187 documents.
Seven clusters had zero docs.
Rest of the clusters had from 1 to 1189 docs.

Very interesting thing is, DC-16,18, 19 have the exact same negative points
as before when I did DPC with my own document set.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DC-16 total= 0 model= DMC:16{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685,
0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430,
0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274,
0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060,
0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938,
0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053,
0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,................
    Top Terms:
jersey based                            =>   5.055564881106928
withdrew offer                          =>   4.160793145890344
although said                           =>  4.1069074456260966
confirmed iraqi                         =>   4.016748531705415
force administration                    =>   3.995899196620034
24.6                                    =>  3.9719147317695596
due mostly                              =>  3.9125799367453267
unit british                            =>  3.9048586110602286
trade source                            =>   3.892495010521945
stevens                                 =>  3.7816279439782554
DC-18 total= 0 model= DMC:18{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685,
0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430,
0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274,
0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060,
0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938,
0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053,
0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,..............
    Top Terms:
jersey based                            =>   5.055564881106928
withdrew offer                          =>   4.160793145890344
although said                           =>  4.1069074456260966
confirmed iraqi                         =>   4.016748531705415
force administration                    =>   3.995899196620034
24.6                                    =>  3.9719147317695596
due mostly                              =>  3.9125799367453267
unit british                            =>  3.9048586110602286
trade source                            =>   3.892495010521945
stevens                                 =>  3.7816279439782554
DC-19 total= 0 model= DMC:19{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685,
0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430,
0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274,
0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060,
0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938,
0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053,
0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,...........
    Top Terms:
jersey based                            =>   5.055564881106928
withdrew offer                          =>   4.160793145890344
although said                           =>  4.1069074456260966
confirmed iraqi                         =>   4.016748531705415
force administration                    =>   3.995899196620034
24.6                                    =>  3.9719147317695596
due mostly                              =>  3.9125799367453267
unit british                            =>  3.9048586110602286
trade source                            =>   3.892495010521945
stevens                                 =>  3.7816279439782554
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So I'm guessing there is some kind of algorithmic problem since the test
sets were different but the same DC-16,18,19 have the same values?

Regards,
Ed

2011/10/28 edward choi <mp2893@gmail.com>

>
> I downloaded the most recent version of Mahout from apache SVN.
> Using the new arguments, I have tested DPC on my own news documents. (not
> reuters set)
>
> Turns out, it really had great improvements. First of all, documents are
> somewhat distributed across 20 clusters.
> The total number of documents were 5896.
> DC-0 had 1014 documents. DC-1 had 4305 documents.
> Nine clusters had zero documents. Rest of the clusters had from 1 to 214
> documents each.
>
> The quality of the clusters weren't so pretty but I guess that has got to
> do with the crude preprocessing step. (raw news documents have links, ads,
> reader comments, etc. etc. etc)
> I will know better when I test with build-reuters.sh
>
> One more thing. Unfortunately there are still some negative values in the
> cluster points.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> DC-16 total= 0 model= DMC:16{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>     Top Terms:
>         kodak camera                            =>  4.5009259007672835
>         player july                             =>   4.216287519075373
>         figure mix                              =>   4.139826527167421
>         department defense                      =>   4.009974576583582
>         remark wednesday                        =>  3.9945681051149564
>         counsel infection                       =>   3.886000915158471
>         jefferson county                        =>  3.8442975919513667
>         jersey say                              =>  3.7821696224124786
>         tell couple                             =>  3.7644857721992415
>         3.5 million                             =>   3.743525174300145
> DC-18 total= 0 model= DMC:18{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>     Top Terms:
>         kodak camera                            =>  4.5009259007672835
>         player july                             =>   4.216287519075373
>         figure mix                              =>   4.139826527167421
>         department defense                      =>   4.009974576583582
>         remark wednesday                        =>  3.9945681051149564
>         counsel infection                       =>   3.886000915158471
>         jefferson county                        =>  3.8442975919513667
>         jersey say                              =>  3.7821696224124786
>         tell couple                             =>  3.7644857721992415
>         3.5 million                             =>   3.743525174300145
> DC-19 total= 0 model= DMC:19{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
> 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
> 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
>     Top Terms:
>         kodak camera                            =>  4.5009259007672835
>         player july                             =>   4.216287519075373
>         figure mix                              =>   4.139826527167421
>         department defense                      =>   4.009974576583582
>         remark wednesday                        =>  3.9945681051149564
>         counsel infection                       =>   3.886000915158471
>         jefferson county                        =>  3.8442975919513667
>         jersey say                              =>  3.7821696224124786
>         tell couple                             =>  3.7644857721992415
>         3.5 million                             =>   3.743525174300145
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Among nine clusters which have zero members, above three have negative
> values.
> Interestingly, all three of them have the exact same values and top terms.
> I wonder what this means.
>
> Anyway I'll post another thread when I have played around with Reuters set.
>
> Ed
>
> ps. The runtime has indeed reduced significantly!!! Possibly 100 times
> faster as you said. Loved it!!
>
> 2011/10/20 Jeff Eastman <jeastman@narus.com>
>
>> R1186452 commits two small changes that seem to do much better with
>> Reuters than before:
>> - fixed DistanceMeasureClusterDistribution to generate Gaussian element
>> values in the prior clusters. Zero values in previous implementation don't
>> work with CosineDistanceMeasure.
>> - changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.sh
>> - switched -mp to DenseVector since all the prior center elements are
>> Gaussian and generally non-zero
>> - increased -a0 to 2
>>
>> Build-reuters now does a much better job with the wide topic vectors using
>> the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new
>> arguments:
>>
>>  $MAHOUT dirichlet \
>>    -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \
>>    -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \
>>    -md
>> org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution
>> \
>>    -mp org.apache.mahout.math.DenseVector \
>>    -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:jeastman@Narus.com]
>> Sent: Wednesday, October 19, 2011 9:53 AM
>> To: user@mahout.apache.org
>> Subject: RE: Dirichlet Process Clustering not working
>>
>> The pdf() implementation in GaussianCluster is pretty lame. It is
>> computing a running product of the element pdfs which, for wide input
>> vectors (Reuters is 41,807), always underflows and returns 0. Here's the
>> code:
>>
>>  public double pdf(VectorWritable vw) {
>>    Vector x = vw.get();
>>    // return the product of the component pdfs
>>    // TODO: is this reasonable? correct? It seems to work in some cases.
>>    double pdf = 1;
>>    for (int i = 0; i < x.size(); i++) {
>>      // small prior on stdDev to avoid numeric instability when stdDev==0
>>      pdf *= UncommonDistributions.dNorm(x.getQuick(i),
>>          getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
>>    }
>>    return pdf;
>>  }
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:jeastman@Narus.com]
>> Sent: Wednesday, October 19, 2011 9:04 AM
>> To: user@mahout.apache.org
>> Subject: RE: Dirichlet Process Clustering not working
>>
>> I agree something is amiss here, but it could be the model is just not
>> suitable for this problem. Running with the Reuters dataset, I see all the
>> points being assigned to C-0 in the very first iteration as you do. I think
>> the problem is with the pdf() calculations in the mapper for very wide
>> vectors such as we are using. For smaller dimension vectors, DPC appears to
>> be working great.
>>
>> I'm going to commit the build-reuters.sh enhancements I've added for
>> FuzzyK and DPC so we can both use the same platform. I will report more
>> progress as I dig in deeper today...
>>
>> -----Original Message-----
>> From: edward choi [mailto:mp2893@gmail.com]
>> Sent: Wednesday, October 19, 2011 8:11 AM
>> To: user@mahout.apache.org
>> Subject: Re: Dirichlet Process Clustering not working
>>
>> Okay, I've just tried DPC with reuters document set.
>> I let the 'build-reuters.sh' create the sequence files and vectors. (From
>> the looks of its dictionary generated by mahout, the number of features
>> seemed to be less than 100,000)
>> Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
>> clustering true, no addtional options)
>> Below is the result of the clusterdump of clusters-10
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>> C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
>> 0.05:0.004, 0.07:0.005, 0.07
>>    Top Terms:
>>        said                                    =>  1.6577128281476725
>>        mln                                     =>  1.2455441154347937
>>        dlrs                                    =>  1.1173752482257673
>>        3                                       =>   1.042824193090437
>>        pct                                     =>  1.0223684722334667
>>        reuter                                  =>  0.9934255143959358
>> C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
>> 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
>>    Top Terms:....
>> C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
>> 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
>>    Top Terms:....
>> C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
>> 0.05:-0.343, 0.07:0.286, 0.077:1.179,
>>    Top Terms:....
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>> I guess the same thing happened again. So the document set is not the
>> problem. Something is definitely wrong with DPC.
>> Interesting thing is that the first cluster point does not have a single
>> negative value in it.
>> Rest of the cluster points have a lot of negative values. So I guess this
>> phenomenon has something to do with the first cluster hogging all the
>> documents.
>> Any comments on this result?
>> (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
>> thread when I am done with that).
>>
>> Regards,
>> Ed
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message