incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@googlemail.com>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Thu, 17 May 2012 22:50:36 GMT
Sebastian Schelter wrote:
> You are right, the code would throw an exception here, but I guess it
> would be sufficient to simply add a check here and not send any messages.

That would avoid throwing an exception, but I am not sure it is actually the best thing to
do.
Some suggest to divide the PageRank scores of dangling nodes evenly among all other pages.
If you want, this is equivalent to another random jump, but forced by the fact that there
are no outgoing links to select from.
Using your notation below:

  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + sumOfDanlingNodesRanks / getNumVertices()
)

See also pag. 38 of the book ""Google's PageRank and Beyond":
http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38

In other works, in pseudo code a serial implementation which accounts for dangling nodes should
be:

--------
dangling_nodes_ranks = 0
foreach node in dangling_node
    dangling_nodes_ranks += ranks.get ( node )
dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;

foeach node1 in graph
    r = 0
    foreach node2 in incoming_links ( node1 )
        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )

r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
--------

What do you think?

Paolo

> 
> --sebastian
> 
> 
> On 17.05.2012 23:44, Paolo Castagna wrote:
>> Hi Sebastian,
>> from SimplePageRankVertex.java, we have:
>>
>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>       long edges = getNumOutEdges();
>>       sendMsgToAllEdges(
>>           new DoubleWritable(getVertexValue().get() / edges));
>>     } else {
>>       voteToHalt();
>>     }
>>
>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>
>> Paolo
>>
>> Sebastian Schelter wrote:
>>> The implementation implicitly deals with dangling nodes by computing the
>>> pageRank as
>>>
>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>
>>> Conceptually, this is the same as connecting every vertex with every
>>> other vertex it is not yet connected to. This removes all dangling nodes
>>> from the graph as every vertex can reach every other vertex.
>>>
>>> "Mining of Massive Datasets" [1] has a very well written explanation of
>>> this approach.
>>>
>>> --sebastian
>>>
>>>
>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>
>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>> Hi,
>>>> I noticed that the SimplePageRankVertex implementation does not deal with
>>>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>>>> And, probably that is one of the good reasons why there is a "Simple" in
front. ;-)
>>>>
>>>>  "Dangling nodes exist in many forms. For example, a page of data, a page
with a
>>>> postscript graph, [...], a page that has been fetched by a crawler but not
yet
>>>> explored - these are all examples of possible dangling nodes. The more ambitious
>>>> the crawl, the bigger the proportion of dangling nodes because the set of
>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>
>>>> Wikipedia's PageRank page says:
>>>>
>>>>  "If a page has no links to other pages, it becomes a sink and therefore
>>>> terminates the random surfing process. If the random surfer arrives at a
sink
>>>> page, it picks another URL at random and continues surfing again. When
>>>> calculating PageRank, pages with no outbound links are assumed to link out
to
>>>> all other pages in the collection. Their PageRank scores are therefore divided
>>>> evenly among all other pages. In other words, to be fair with pages that
are not
>>>> sinks, these random transitions are added to all nodes in the Web, with a
>>>> residual probability usually set to d = 0.85, estimated from the frequency
that
>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>
>>>> Now, if I want to try to account for the dangling nodes I need to sum all
>>>> PageRank values for the dangling nodes and divide it evenly among all other
>>>> pages (which in practce means to send a message to all vertexes in Giraph...
>>>> which is not possible, as it would be crazy with large graphs).
>>>>
>>>> I imagine one can use a SumAggregator and alternate computing contribution
from
>>>> dangling nodes and PageRank values (i.e. PageRank values are computed only
every
>>>> 2 supersteps).
>>>>
>>>> What do you think?
>>>>
>>>> Paolo
>>>>
>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>>>> Science of Search Engine Rankings" (2006)
>>>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Mime
View raw message