flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vasia <...@git.apache.org>
Subject [GitHub] flink pull request: [FLINK-2310] Add an Adamic Adar Similarity exa...
Date Wed, 08 Jul 2015 22:36:28 GMT
Github user vasia commented on the pull request:

    https://github.com/apache/flink/pull/892#issuecomment-119750756
  
    Hi @shghatge,
    thank you for working on this! It's not a trivial problem :)
    
    I have two general comments:
    
    - I would prefer to see this as library method instead of an example. As an example, it
shows how to use neighborhood methods, joinWithVertices and getTriplets, but we already have
the JaccardSimilarity for all of these.
    
    - The algorithm is actually very hard to scale. It essentially requires common neighborhood
calculation. This is a real pain for large graphs with skewed vertices. @andralungu can explain
to you more, since she's been working on optimizing similar problems for her work.
    An additional problem that this particular implementation has is that it also stores and
sends vertex values, not only vertex IDs. If you try to run this with a big graph, it will
probably crush or run forever.
    
    There are 2 things we could do:
    
    1). Only send a list of vertex IDs and let the receiving vertex generate partial sums
for each common neighbor. At the end, we can aggregate the sums per edge. For example, consider
the edges (1, 2), (1, 3), (2, 3), (1, 4) and (2, 4). In the neighborhood method, vertex 3
will find that it has a common neighbors with 1, vertex 2. Then, it will emit a tuple `<1,
2, 1/log(d3)>`. In the same way, vertex 4 will find that it has a common neighbor with
1, vertex 2 and it will emit a tuple `<1, 2, 1/log(d4)>`.
    The output of the neighborhood method will be a dataset of tuples that represent edges
with partial sums. You can then compute the Adamic-Adar similarity by summing up the values
per vertex.
    
    2). The above solution will be OK for regular graphs, but will still have problems in
the presence of skew. Another approach would be to offer an approximate Adamic-Adar similarity,
by representing neighborhoods as bloom filters. In this case, the size of common neighborhoods
will be over-estimated, but the computation will be faster.
    
    @shghatge, @andralungu, let me know what you think and how you would like to proceed!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message