I'd like to implement the test described in this paper [1] and also
explained in this presentation [2].
I went over the paper and I think I understand it well enough.
The main gist is that in when dealing with highdimensional data that has
lots of uncorrelated features (which should totally not be the case for
us!), distances becomes meaningless as the ratio between minimum distance
and maximum distance becomes less than some small constant factor.
It's not really about this particular data set, but since I find figuring
out whether distances are relevant or not challenging, I feel that any help
is welcome.
What do you think Ted?
[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
[2] http://www.cs.bham.ac.uk/~axk/Dagstuhl.pdf
On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon
<dangeorge.filimon@gmail.com>wrote:
> And I'll add that revectorizing the documents with my vectorizer yields
> essentially the same results (this is CosineDistance though):
>
> Average distance in cluster 0 [6]: 0.844053
> Average distance in cluster 1 [1047]: 0.988517
> Average distance in cluster 2 [26]: 0.889580
> Average distance in cluster 3 [19]: 0.922804
> Average distance in cluster 4 [2]: 0.414935
> Average distance in cluster 5 [9]: 0.777650
> Average distance in cluster 6 [4]: 0.791443
> Average distance in cluster 7 [17432]: 1.017289
> Average distance in cluster 8 [20]: 0.917523
> Average distance in cluster 9 [4]: 0.744159
> Average distance in cluster 10 [2]: 0.340740
> Average distance in cluster 11 [3]: 0.614734
> Average distance in cluster 12 [2]: 0.624274
> Average distance in cluster 13 [62]: 0.922437
> Average distance in cluster 14 [2]: 0.324862
> Average distance in cluster 15 [1]: 0.000000
> Average distance in cluster 16 [94]: 0.917509
> Average distance in cluster 17 [103]: 0.944392
> Average distance in cluster 18 [7]: 0.795449
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 1.029701
>
>
> On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon <dangeorge.filimon@gmail.com>wrote:
>
>> You know what's even more odd? When I used Mahout's KMeans, everything
>> was assigned to one single cluster with mean distance 64.
>>
>>
>> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <ted.dunning@gmail.com>wrote:
>>
>>> Hmm... looking at these outputs, it looks like the big cluster is really
>>> tight ... much tighter than cluster 3 or 4. That is very odd.
>>>
>>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
>>> <dangeorge.filimon@gmail.com>wrote:
>>>
>>> > [Yes, it should be on the dev list. I got confused.]
>>> >
>>> > The thing is, it's happening when using just 1 mapper. The hypercube
>>> > tests indicate that the 3 versions of StreamingKMeans produce about
>>> > the same results.
>>> > I haven't tested them on the _unprojected_ vectors though.
>>> >
>>> > Average distance in cluster 0 [18773]: 68.237385
>>> > Average distance in cluster 1 [2]: 5.973227
>>> > Average distance in cluster 2 [1]: 0.000000
>>> > Average distance in cluster 3 [4]: 279.200390
>>> > Average distance in cluster 4 [5]: 394.101672
>>> > Average distance in cluster 5 [4]: 227.845612
>>> > Average distance in cluster 6 [1]: 0.000000
>>> > Average distance in cluster 7 [2]: 28.779806
>>> > Average distance in cluster 8 [1]: 0.000000
>>> > Average distance in cluster 9 [2]: 215.254876
>>> > Average distance in cluster 10 [3]: 128.501163
>>> > Average distance in cluster 11 [8]: 534.401649
>>> > Average distance in cluster 12 [1]: 0.000000
>>> > Average distance in cluster 13 [5]: 405.115140
>>> > Average distance in cluster 14 [1]: 0.000000
>>> > Average distance in cluster 15 [9]: 215.797289
>>> > Average distance in cluster 16 [1]: 0.000000
>>> > Average distance in cluster 17 [2]: 123.065677
>>> > Average distance in cluster 18 [1]: 0.000000
>>> > Average distance in cluster 19 [2]: 98.733778
>>> > Num clusters: 20; maxDistance: 762.326896
>>> >
>>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <ted.dunning@gmail.com>
>>> > wrote:
>>> > > I will have to think on this a bit.
>>> > >
>>> > > It should be possible to dump the sketches coming from each mapper
>>> and
>>> > look
>>> > > at them for compatibility.
>>> > >
>>> > > Are the mappers seeing only docs from a single news group? That
>>> might
>>> > > produce some interesting and odd results.
>>> > >
>>> > > What happens with the sequential version when you specify as many
>>> threads
>>> > > as you have mappers in the MR version?
>>> > >
>>> > > Also, sholdn't this be on the dev list?
>>> > >
>>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
>>> > dangeorge.filimon@gmail.com>wrote:
>>> > >
>>> > >> So no, apparently the problem's still there. With the most recent
>>> code,
>>> > I
>>> > >> get:
>>> > >>
>>> > >> Average distance in cluster 0 [1]: 0.000000
>>> > >> Average distance in cluster 1 [18775]: 63.839819
>>> > >> Average distance in cluster 2 [11]: 448.706077
>>> > >> Average distance in cluster 3 [1]: 0.000000
>>> > >> Average distance in cluster 4 [8]: 213.629578
>>> > >> Average distance in cluster 5 [1]: 0.000000
>>> > >> Average distance in cluster 6 [10]: 369.592682
>>> > >> Average distance in cluster 7 [1]: 0.000000
>>> > >> Average distance in cluster 8 [2]: 31.061103
>>> > >> Average distance in cluster 9 [1]: 0.000000
>>> > >> Average distance in cluster 10 [2]: 309.934857
>>> > >> Average distance in cluster 11 [1]: 0.000000
>>> > >> Average distance in cluster 12 [1]: 0.000000
>>> > >> Average distance in cluster 13 [1]: 0.000000
>>> > >> Average distance in cluster 14 [1]: 0.000000
>>> > >> Average distance in cluster 15 [4]: 229.180504
>>> > >> Average distance in cluster 16 [1]: 0.000000
>>> > >> Average distance in cluster 17 [3]: 336.835246
>>> > >> Average distance in cluster 18 [2]: 76.485594
>>> > >> Average distance in cluster 19 [1]: 0.000000
>>> > >> Num clusters: 20; maxDistance: 724.060033
>>> > >>
>>> > >> I'll have to recheck. :/
>>> > >>
>>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunning@gmail.com
>>> >
>>> > >> wrote:
>>> > >> > Hot damn!
>>> > >> >
>>> > >> > Well spotted.
>>> > >> >
>>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>>> > >> > <dangeorge.filimon@gmail.com>wrote:
>>> > >> >
>>> > >> >> Ted, remember we talked about this last week?
>>> > >> >>
>>> > >> >> The problem was (I think it's fixed now) that when I was
asking
>>> for
>>> > 20
>>> > >> >> clusters, every mapper would give me 20 clusters (rather
than k
>>> log n
>>> > >> >> ~ 200) and the points clumped together resulting in one
cluster
>>> with
>>> > >> >> the vast majority of the points ~17K out the ~19K.
>>> > >> >>
>>> > >> >> Now that I fixed that added more tests that seem to be
>>> confirming all
>>> > >> >> StreamingKMeans implementations get about the same results
>>> (whether
>>> > >> >> they're local or MapReduce) and the multiple restarts
of
>>> BallKMeans,
>>> > >> >> I'm expecting it to be a lot better.
>>> > >> >>
>>> > >> >> Actual data tests coming soon (please check that new cluster
>>> > thread). :)
>>> > >> >>
>>> > >>
>>> >
>>>
>>
>>
>
