gearpump-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kam Kasravi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEARPUMP-55) Add kmeans example
Date Wed, 20 Apr 2016 12:29:25 GMT

    [ https://issues.apache.org/jira/browse/GEARPUMP-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249768#comment-15249768
] 

Kam Kasravi commented on GEARPUMP-55:
-------------------------------------

Comment from [~mauzhang]

It will be great if we can have this example as a zeppelin notebook

> Add kmeans example
> ------------------
>
>                 Key: GEARPUMP-55
>                 URL: https://issues.apache.org/jira/browse/GEARPUMP-55
>             Project: Apache Gearpump
>          Issue Type: New Feature
>          Components: examples
>    Affects Versions: 0.8.0
>            Reporter: Kam Kasravi
>            Priority: Minor
>             Fix For: 0.8.1
>
>
> From [pangolulu|https://github.com/pangolulu]
> There is a document about streaming kmeans in Spark (https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html),
I think we can try to implement it on Gearpump. Here is my processor topology on Gearpump:
> !https://cloud.githubusercontent.com/assets/5796671/14097520/93a2b498-f5a4-11e5-8df8-ef2b62c3b5ff.PNG!
> The `Source Processor` will produce points by time, then broadcast the point to the `Distribution
Processor`. The number of tasks of the `Distribution Processor` is k, where each task save
one center and the corresponding points. When `Distribution Processor` receives a point from
`Source Processor`, it will calculate the distance of this point to its center, and then send
the distance along with the point and its `taskId` to the `Collection Processor`. When `Collection
Processor` receives the distance from `Distribution Processor`, it will accumulate the number
of current points, determine if it's time to update center, choose the smallest distance and
then send the point along with its corresponding `Distribution Processor` taskId by broadcast
partitioner. When `Distribution Processor` receives the result message, task with the corresponding
`taskId` will accumulate the point. If `Distribution Processor` receives that it's time to
update center, then all the tasks will update its corresponding center.
> This procedure is streaming and the center of cluster will change by time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message