giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Thu, 17 May 2012 16:42:34 GMT
The implementation implicitly deals with dangling nodes by computing the
pageRank as

0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks

Conceptually, this is the same as connecting every vertex with every
other vertex it is not yet connected to. This removes all dangling nodes
from the graph as every vertex can reach every other vertex.

"Mining of Massive Datasets" [1] has a very well written explanation of
this approach.



On 17.05.2012 16:23, Paolo Castagna wrote:
> Hi,
> I noticed that the SimplePageRankVertex implementation does not deal with
> dangling nodes (dangling nodes are those nodes with no outgoing links).
> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
> postscript graph, [...], a page that has been fetched by a crawler but not yet
> explored - these are all examples of possible dangling nodes. The more ambitious
> the crawl, the bigger the proportion of dangling nodes because the set of
> fetched but uncrawled pages grows quickly." (pag. 80) [1]
> Wikipedia's PageRank page says:
>  "If a page has no links to other pages, it becomes a sink and therefore
> terminates the random surfing process. If the random surfer arrives at a sink
> page, it picks another URL at random and continues surfing again. When
> calculating PageRank, pages with no outbound links are assumed to link out to
> all other pages in the collection. Their PageRank scores are therefore divided
> evenly among all other pages. In other words, to be fair with pages that are not
> sinks, these random transitions are added to all nodes in the Web, with a
> residual probability usually set to d = 0.85, estimated from the frequency that
> an average surfer uses his or her browser's bookmark feature." [2]
> Now, if I want to try to account for the dangling nodes I need to sum all
> PageRank values for the dangling nodes and divide it evenly among all other
> pages (which in practce means to send a message to all vertexes in Giraph...
> which is not possible, as it would be crazy with large graphs).
> I imagine one can use a SumAggregator and alternate computing contribution from
> dangling nodes and PageRank values (i.e. PageRank values are computed only every
> 2 supersteps).
> What do you think?
> Paolo
>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
> Science of Search Engine Rankings" (2006)
>  [2]

View raw message