mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Uppuluri (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
Date Wed, 03 Mar 2010 10:48:27 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840574#action_12840574
] 

Rohini Uppuluri commented on MAHOUT-153:
----------------------------------------


 Hi, 

Please find a brief description on input and output below:
Hope this helps:

------------------------------------------------------------------------------

Input Format:
documentId\tdocument vector


Example line:

338	     [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 197:5.0, 196:2.0, 286:4.0,
135:5.0, 

86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 654:5.0, 443:5.0,


990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 708:5.0, 497:3.0,
498:4.0, 

523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, ] 


Output Format:
ClusterIdentifier\tClusterIdentifier: clusterCenterVector

Example line:
C0	C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 276:3.0, 7:5.0,
283:3.0, 

282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 284:3.0, 258:4.0,
17:3.0, 

257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 307:4.0, 39:5.0,


311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 532:4.0, 291:5.0,


292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 302:4.0, 343:4.0,


882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 345:1.0, 344:1.0,
326:4.0, 

327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 328:4.0, 636:4.0,
100:5.0, 

98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 127:5.0, 1073:4.0,


603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 696:4.0, 1244:4.0,
147:5.0, 

690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 164:5.0, 642:4.0,
185:5.0, 

431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 659:4.0, 1016:4.0,
477:1.0, 

751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 466:5.0, 742:5.0,
1013:3.0, 

1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 235:3.0, 504:5.0,
717:4.0, 

234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 252:3.0, 250:5.0,
248:4.0, 

245:4.0, ] 





> Implement kmeans++ for initial cluster selection in kmeans
> ----------------------------------------------------------
>
>                 Key: MAHOUT-153
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-153
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>         Environment: OS Independent
>            Reporter: Panagiotis Papadimitriou
>            Assignee: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for initial cluster
selection (seed selection): 1) random selection of k points, 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator.
The existing class RandomSeedGenerator will become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message