mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <dangeorge.fili...@gmail.com>
Subject Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]
Date Thu, 28 Mar 2013 09:01:30 GMT
[Yes, it should be on the dev list. I got confused.]

The thing is, it's happening when using just 1 mapper. The hypercube
tests indicate that the 3 versions of StreamingKMeans produce about
the same results.
I haven't tested them on the _unprojected_ vectors though.

Average distance in cluster 0 [18773]: 68.237385
Average distance in cluster 1 [2]: 5.973227
Average distance in cluster 2 [1]: 0.000000
Average distance in cluster 3 [4]: 279.200390
Average distance in cluster 4 [5]: 394.101672
Average distance in cluster 5 [4]: 227.845612
Average distance in cluster 6 [1]: 0.000000
Average distance in cluster 7 [2]: 28.779806
Average distance in cluster 8 [1]: 0.000000
Average distance in cluster 9 [2]: 215.254876
Average distance in cluster 10 [3]: 128.501163
Average distance in cluster 11 [8]: 534.401649
Average distance in cluster 12 [1]: 0.000000
Average distance in cluster 13 [5]: 405.115140
Average distance in cluster 14 [1]: 0.000000
Average distance in cluster 15 [9]: 215.797289
Average distance in cluster 16 [1]: 0.000000
Average distance in cluster 17 [2]: 123.065677
Average distance in cluster 18 [1]: 0.000000
Average distance in cluster 19 [2]: 98.733778
Num clusters: 20; maxDistance: 762.326896

On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> I will have to think on this a bit.
>
> It should be possible to dump the sketches coming from each mapper and look
> at them for compatibility.
>
> Are the mappers seeing only docs from a single news group?  That might
> produce some interesting and odd results.
>
> What happens with the sequential version when you specify as many threads
> as you have mappers in the MR version?
>
> Also, sholdn't this be on the dev list?
>
> On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <dangeorge.filimon@gmail.com>wrote:
>
>> So no, apparently the problem's still there. With the most recent code, I
>> get:
>>
>> Average distance in cluster 0 [1]: 0.000000
>> Average distance in cluster 1 [18775]: 63.839819
>> Average distance in cluster 2 [11]: 448.706077
>> Average distance in cluster 3 [1]: 0.000000
>> Average distance in cluster 4 [8]: 213.629578
>> Average distance in cluster 5 [1]: 0.000000
>> Average distance in cluster 6 [10]: 369.592682
>> Average distance in cluster 7 [1]: 0.000000
>> Average distance in cluster 8 [2]: 31.061103
>> Average distance in cluster 9 [1]: 0.000000
>> Average distance in cluster 10 [2]: 309.934857
>> Average distance in cluster 11 [1]: 0.000000
>> Average distance in cluster 12 [1]: 0.000000
>> Average distance in cluster 13 [1]: 0.000000
>> Average distance in cluster 14 [1]: 0.000000
>> Average distance in cluster 15 [4]: 229.180504
>> Average distance in cluster 16 [1]: 0.000000
>> Average distance in cluster 17 [3]: 336.835246
>> Average distance in cluster 18 [2]: 76.485594
>> Average distance in cluster 19 [1]: 0.000000
>> Num clusters: 20; maxDistance: 724.060033
>>
>> I'll have to recheck. :/
>>
>> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > Hot damn!
>> >
>> > Well spotted.
>> >
>> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>> > <dangeorge.filimon@gmail.com>wrote:
>> >
>> >> Ted, remember we talked about this last week?
>> >>
>> >> The problem was (I think it's fixed now) that when I was asking for 20
>> >> clusters, every mapper would give me 20 clusters (rather than k log n
>> >> ~ 200) and the points clumped together resulting in one cluster with
>> >> the vast majority of the points ~17K out the ~19K.
>> >>
>> >> Now that I fixed that added more tests that seem to be confirming all
>> >> StreamingKMeans implementations get about the same results (whether
>> >> they're local or MapReduce) and the multiple restarts of BallKMeans,
>> >> I'm expecting it to be a lot better.
>> >>
>> >> Actual data tests coming soon (please check that new cluster thread). :)
>> >>
>>

Mime
View raw message