incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Thu, 17 May 2012 22:02:55 GMT
You are right, the code would throw an exception here, but I guess it
would be sufficient to simply add a check here and not send any messages.

--sebastian


On 17.05.2012 23:44, Paolo Castagna wrote:
> Hi Sebastian,
> from SimplePageRankVertex.java, we have:
> 
>     if (getSuperstep() < MAX_SUPERSTEPS) {
>       long edges = getNumOutEdges();
>       sendMsgToAllEdges(
>           new DoubleWritable(getVertexValue().get() / edges));
>     } else {
>       voteToHalt();
>     }
> 
> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
> 
> Paolo
> 
> Sebastian Schelter wrote:
>> The implementation implicitly deals with dangling nodes by computing the
>> pageRank as
>>
>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>
>> Conceptually, this is the same as connecting every vertex with every
>> other vertex it is not yet connected to. This removes all dangling nodes
>> from the graph as every vertex can reach every other vertex.
>>
>> "Mining of Massive Datasets" [1] has a very well written explanation of
>> this approach.
>>
>> --sebastian
>>
>>
>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>
>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>> Hi,
>>> I noticed that the SimplePageRankVertex implementation does not deal with
>>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>>> And, probably that is one of the good reasons why there is a "Simple" in front.
;-)
>>>
>>>  "Dangling nodes exist in many forms. For example, a page of data, a page with
a
>>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>>> explored - these are all examples of possible dangling nodes. The more ambitious
>>> the crawl, the bigger the proportion of dangling nodes because the set of
>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>
>>> Wikipedia's PageRank page says:
>>>
>>>  "If a page has no links to other pages, it becomes a sink and therefore
>>> terminates the random surfing process. If the random surfer arrives at a sink
>>> page, it picks another URL at random and continues surfing again. When
>>> calculating PageRank, pages with no outbound links are assumed to link out to
>>> all other pages in the collection. Their PageRank scores are therefore divided
>>> evenly among all other pages. In other words, to be fair with pages that are
not
>>> sinks, these random transitions are added to all nodes in the Web, with a
>>> residual probability usually set to d = 0.85, estimated from the frequency that
>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>
>>> Now, if I want to try to account for the dangling nodes I need to sum all
>>> PageRank values for the dangling nodes and divide it evenly among all other
>>> pages (which in practce means to send a message to all vertexes in Giraph...
>>> which is not possible, as it would be crazy with large graphs).
>>>
>>> I imagine one can use a SumAggregator and alternate computing contribution from
>>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>>> 2 supersteps).
>>>
>>> What do you think?
>>>
>>> Paolo
>>>
>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>>> Science of Search Engine Rankings" (2006)
>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>


Mime
View raw message