mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: MAHOUT-236 Cluster Evaluation Tools?
Date Fri, 09 Apr 2010 03:44:31 GMT
Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?

Robin

On Fri, Apr 9, 2010 at 4:13 AM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

> Looking at the paper it doesn't seem to require MR for the final CDbw
> calculation, right? For each cluster we only need to compare one of its
> points with one point in each other cluster. With small numbers of
> representative points per cluster that can be done easily in memory. I'd
> love to see the code you have for computing representative points.
>
> Jeff
>
>
>
> Robin Anil wrote:
>
>> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Hi Robin,
>>>
>>> Interesting paper. I'm beginning to see how to MR the representative
>>> point
>>> selection already. The rest will hopefully become clearer with more
>>> study.
>>> Lots of MR jobs are needed to:
>>>
>>>
>>
>>
>>
>>
>>
>>> a) get the data into Vectors, We have something for text, missing for
>>> other
>>> formats
>>>
>>>
>>
>>
>>
>>
>>
>>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> c) cluster the data, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> d) iterate over the clustered data to derive representative points for
>>> each
>>> cluster, and finally Done ;)
>>>
>>>
>>
>>
>>
>>
>>
>>> e) produce the CDbw.- TODO
>>>
>>>
>>
>>
>>
>>
>>
>>
>>> And, of course all of this is again iterated with different values for
>>> the
>>> clustering algorithm's parameters. Should keep the lights on at PG&E
>>> producing power for the server farms.
>>>
>>>
>>>
>>> Robin Anil wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>           This is an good paper with a simple measure of cluster quality
>>>> measurement based on intra cluster density and inter cluster separation.
>>>> Its
>>>> pretty easy to compute. Need to make it a map/reduce job
>>>>
>>>>
>>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>>> Robin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message