giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@googlemail.com>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Thu, 17 May 2012 21:44:27 GMT
Hi Sebastian,
from SimplePageRankVertex.java, we have:

    if (getSuperstep() < MAX_SUPERSTEPS) {
      long edges = getNumOutEdges();
      sendMsgToAllEdges(
          new DoubleWritable(getVertexValue().get() / edges));
    } else {
      voteToHalt();
    }

What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?

Paolo

Sebastian Schelter wrote:
> The implementation implicitly deals with dangling nodes by computing the
> pageRank as
> 
> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
> 
> Conceptually, this is the same as connecting every vertex with every
> other vertex it is not yet connected to. This removes all dangling nodes
> from the graph as every vertex can reach every other vertex.
> 
> "Mining of Massive Datasets" [1] has a very well written explanation of
> this approach.
> 
> --sebastian
> 
> 
> [1] http://infolab.stanford.edu/~ullman/mmds.html
> 
> On 17.05.2012 16:23, Paolo Castagna wrote:
>> Hi,
>> I noticed that the SimplePageRankVertex implementation does not deal with
>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
>>
>>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>> explored - these are all examples of possible dangling nodes. The more ambitious
>> the crawl, the bigger the proportion of dangling nodes because the set of
>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>
>> Wikipedia's PageRank page says:
>>
>>  "If a page has no links to other pages, it becomes a sink and therefore
>> terminates the random surfing process. If the random surfer arrives at a sink
>> page, it picks another URL at random and continues surfing again. When
>> calculating PageRank, pages with no outbound links are assumed to link out to
>> all other pages in the collection. Their PageRank scores are therefore divided
>> evenly among all other pages. In other words, to be fair with pages that are not
>> sinks, these random transitions are added to all nodes in the Web, with a
>> residual probability usually set to d = 0.85, estimated from the frequency that
>> an average surfer uses his or her browser's bookmark feature." [2]
>>
>> Now, if I want to try to account for the dangling nodes I need to sum all
>> PageRank values for the dangling nodes and divide it evenly among all other
>> pages (which in practce means to send a message to all vertexes in Giraph...
>> which is not possible, as it would be crazy with large graphs).
>>
>> I imagine one can use a SumAggregator and alternate computing contribution from
>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>> 2 supersteps).
>>
>> What do you think?
>>
>> Paolo
>>
>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>> Science of Search Engine Rankings" (2006)
>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Mime
View raw message