Hi Sebastian,
from SimplePageRankVertex.java, we have:
if (getSuperstep() < MAX_SUPERSTEPS) {
long edges = getNumOutEdges();
sendMsgToAllEdges(
new DoubleWritable(getVertexValue().get() / edges));
} else {
voteToHalt();
}
What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
Paolo
Sebastian Schelter wrote:
> The implementation implicitly deals with dangling nodes by computing the
> pageRank as
>
> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>
> Conceptually, this is the same as connecting every vertex with every
> other vertex it is not yet connected to. This removes all dangling nodes
> from the graph as every vertex can reach every other vertex.
>
> "Mining of Massive Datasets" [1] has a very well written explanation of
> this approach.
>
> sebastian
>
>
> [1] http://infolab.stanford.edu/~ullman/mmds.html
>
> On 17.05.2012 16:23, Paolo Castagna wrote:
>> Hi,
>> I noticed that the SimplePageRankVertex implementation does not deal with
>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>> And, probably that is one of the good reasons why there is a "Simple" in front. ;)
>>
>> "Dangling nodes exist in many forms. For example, a page of data, a page with a
>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>> explored  these are all examples of possible dangling nodes. The more ambitious
>> the crawl, the bigger the proportion of dangling nodes because the set of
>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>
>> Wikipedia's PageRank page says:
>>
>> "If a page has no links to other pages, it becomes a sink and therefore
>> terminates the random surfing process. If the random surfer arrives at a sink
>> page, it picks another URL at random and continues surfing again. When
>> calculating PageRank, pages with no outbound links are assumed to link out to
>> all other pages in the collection. Their PageRank scores are therefore divided
>> evenly among all other pages. In other words, to be fair with pages that are not
>> sinks, these random transitions are added to all nodes in the Web, with a
>> residual probability usually set to d = 0.85, estimated from the frequency that
>> an average surfer uses his or her browser's bookmark feature." [2]
>>
>> Now, if I want to try to account for the dangling nodes I need to sum all
>> PageRank values for the dangling nodes and divide it evenly among all other
>> pages (which in practce means to send a message to all vertexes in Giraph...
>> which is not possible, as it would be crazy with large graphs).
>>
>> I imagine one can use a SumAggregator and alternate computing contribution from
>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>> 2 supersteps).
>>
>> What do you think?
>>
>> Paolo
>>
>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>> Science of Search Engine Rankings" (2006)
>> [2] http://en.wikipedia.org/wiki/PageRank
>
