flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3780) Jaccard Similarity
Date Fri, 20 May 2016 14:26:12 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293438#comment-15293438
] 

ASF GitHub Bot commented on FLINK-3780:
---------------------------------------

Github user vasia commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1980#discussion_r64049343
  
    --- Diff: docs/apis/batch/libs/gelly.md ---
    @@ -2055,22 +2055,22 @@ vertex and edge in the output graph stores the common group value
and the number
     ### Jaccard Index
     
     #### Overview
    -The Jaccard Index measures the similarity between vertex neighborhoods. Scores range
from 0.0 (no common neighbors) to
    -1.0 (all neighbors are common).
    +The Jaccard Index measures the similarity between vertex neighborhoods and is computed
as the number of shared numbers
    +divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors)
to 1.0 (all neighbors are
    +shared).
     
     #### Details
    -Counting common neighbors for pairs of vertices is equivalent to counting the two-paths
consisting of two edges
    -connecting the two vertices to the common neighbor. The number of distinct neighbors
for pairs of vertices is computed
    -by storing the sum of degrees of the vertex pair and subtracting the count of common
neighbors, which are double-counted
    -in the sum of degrees.
    +Counting shared neighbors for pairs of vertices is equivalent to counting connecting
paths of length two. The number of
    +distinct neighbors is computed by storing the sum of degrees of the vertex pair and subtracting
the count of shared
    +neighbors, which are double-counted in the sum of degrees.
     
    -The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint
vertex, each pair of
    -neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common
neighbors are counted.
    +The algorithm first annotates each edge with the target vertex's degree. Grouping on
the source vertex, each pair of
    +neighbors is emitted with the degree sum. Grouping on two-paths, the shared neighbors
are counted.
     
     #### Usage
     The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples
containing two vertex IDs,
    -the number of common neighbors, and the number of distinct neighbors. The graph ID type
must be `Comparable` and
    -`Copyable`.
    +the number of shared neighbors, and the number of distinct neighbors. The result class
provides a method to compute the
    +Jaccard Index score. The graph ID type must be `Comparable` and `Copyable`.
    --- End diff --
    
    Here we should also document what is the output of the algorithm, i.e. the `Result` type
and how to get the jaccard similarity out of it.


> Jaccard Similarity
> ------------------
>
>                 Key: FLINK-3780
>                 URL: https://issues.apache.org/jira/browse/FLINK-3780
>             Project: Flink
>          Issue Type: New Feature
>          Components: Gelly
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
>             Fix For: 1.1.0
>
>
> Implement a Jaccard Similarity algorithm computing all non-zero similarity scores. This
algorithm is similar to {{TriangleListing}} but instead of joining two-paths against an edge
list we count two-paths.
> {{flink-gelly-examples}} currently has {{JaccardSimilarityMeasure}} which relies on {{Graph.getTriplets()}}
so only computes similarity scores for neighbors but not neighbors-of-neighbors.
> This algorithm is easily modified for other similarity scores such as Adamic-Adar similarity
where the sum of endpoint degrees is replaced by the degree of the middle vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message