spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Adetiloye (JIRA)" <>
Subject [jira] [Commented] (SPARK-18363) Connected component for large graph result is wrong
Date Wed, 09 Nov 2016 20:53:59 GMT


Philip Adetiloye commented on SPARK-18363:

I logged a similar Issue with the graphframe but the problem exist also in Graphx

Basically I'm trying to cluster a hierarchical dataset. This works fine for small dataset,
I could cluster the data into separate clusters. 

However, for large hierarchical dataset (about `1.60million vertices`) the result seems wrong.

The resulting clusters from connected component have many intersections. This should not be
the case. I expect the hierarchical dataset to be clustered into separate smaller clusters.

    val vertices = => (, u.username, u.age, u.gamescore))
                              .toDF("id", "username", "age","gamescore")

    val lookup = sparkSession.sparkContext.broadcast(universeMap.rdd.collectAsMap())

    def buildEdges(src: String, dest: String) = {
        Edge(lookup.value.get(src).get, lookup.value.get(dest).get, 0)

    val edges  =  similarityDatasetNoJboss.mapPartitions( => buildEdges(s.username1,
                                          .toDF("src", "dst", "default")

    val graph = GraphFrame(vertices, edges)

    val cc ="id", "component")

Do some validation test

    Select id, count(component)
    group by id

I expect each`id` to belong to one cluster/component and count = 1 instead `id` belong to
multiple clusters/component.

> Connected component for large graph result is wrong
> ---------------------------------------------------
>                 Key: SPARK-18363
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>    Affects Versions: 2.0.1
>            Reporter: Philip Adetiloye
> The clustering done by Graphx connected component doesn't seems to work correctly with
large nodes.
> It only works correctly on a small graph

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message