spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Adetiloye (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18363) Connected component for large graph result is wrong
Date Wed, 09 Nov 2016 20:53:59 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652009#comment-15652009
] 

Philip Adetiloye commented on SPARK-18363:
------------------------------------------

I logged a similar Issue with the graphframe but the problem exist also in Graphx

Basically I'm trying to cluster a hierarchical dataset. This works fine for small dataset,
I could cluster the data into separate clusters. 

However, for large hierarchical dataset (about `1.60million vertices`) the result seems wrong.

The resulting clusters from connected component have many intersections. This should not be
the case. I expect the hierarchical dataset to be clustered into separate smaller clusters.

    ...
    val vertices = universe.map(u => (u.id, u.username, u.age, u.gamescore))
                              .toDF("id", "username", "age","gamescore")
                              .alias("v")


    val lookup = sparkSession.sparkContext.broadcast(universeMap.rdd.collectAsMap())


    def buildEdges(src: String, dest: String) = {
        Edge(lookup.value.get(src).get, lookup.value.get(dest).get, 0)
    }


    val edges  =  similarityDatasetNoJboss.mapPartitions(_.map(s => buildEdges(s.username1,
s.username2)))
                                          .toDF("src", "dst", "default")

    val graph = GraphFrame(vertices, edges)

    val cc = graph.connectedComponents.run().select("id", "component")


Do some validation test

    Select id, count(component)
    group by id

I expect each`id` to belong to one cluster/component and count = 1 instead `id` belong to
multiple clusters/component.

> Connected component for large graph result is wrong
> ---------------------------------------------------
>
>                 Key: SPARK-18363
>                 URL: https://issues.apache.org/jira/browse/SPARK-18363
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>    Affects Versions: 2.0.1
>            Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work correctly with
large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message