flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vasia Kalavri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
Date Wed, 11 May 2016 14:45:13 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280246#comment-15280246
] 

Vasia Kalavri commented on FLINK-3879:
--------------------------------------

[~greghogan]
- Do we agree that the PR for FLINK-2044 is now in good state and could be merged? Or would
you rather benchmark this against it and go for the most performant one?
- Gelly library methods: currently there are scatter-gather and GSA implementations for PageRank,
Connected Components, and SSSP. We have these because GSA performs better for graphs with
skewed degree distributions. In the Gelly docs-[iteration abstractions comparison|https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html#iteration-abstractions-comparison],
we describe when GSA should be preferred over scatter-gather. Maybe we can make this more
explicit.
There is no Pregel implementation (only in examples). The {{GSATriangleCount}} library method
has proved to be very inefficient and should be removed imo (I'll open a JIRA).
- I'm not sure what you mean by "approximate HITS"?

> Native implementation of HITS algorithm
> ---------------------------------------
>
>                 Key: FLINK-3879
>                 URL: https://issues.apache.org/jira/browse/FLINK-3879
>             Project: Flink
>          Issue Type: New Feature
>          Components: Gelly
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
>             Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented in [0]
and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on the link
information among a set of documents. The algorithm presumes that a good hub is a document
that points to many others, and a good authority is a document that many documents point to."
[https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, outputting
both hub and authority scores, and completing in half the number of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message