Could you just accumulate the PR of dangling nodes in a variable and divide
it equally in the next iteration?
If you use a preference vector instead of dividing it equally, then you
have also the strongly preferential (and weakly preferential in case you
divide equally) versions of RWR.
Cheers,

Gianmarco
On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:
> Sebastian Schelter wrote:
> > You are right, the code would throw an exception here, but I guess it
> > would be sufficient to simply add a check here and not send any messages.
>
> That would avoid throwing an exception, but I am not sure it is actually
> the best thing to do.
> Some suggest to divide the PageRank scores of dangling nodes evenly among
> all other pages.
> If you want, this is equivalent to another random jump, but forced by the
> fact that there are no outgoing links to select from.
> Using your notation below:
>
> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
> sumOfDanlingNodesRanks / getNumVertices() )
>
> See also pag. 38 of the book ""Google's PageRank and Beyond":
> http://books.google.co.uk/books?id=hxvB14I0twC&pg=PA38
>
> In other works, in pseudo code a serial implementation which accounts for
> dangling nodes should be:
>
> 
> dangling_nodes_ranks = 0
> foreach node in dangling_node
> dangling_nodes_ranks += ranks.get ( node )
> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>
> foeach node1 in graph
> r = 0
> foreach node2 in incoming_links ( node1 )
> r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>
> r = dumping_factor * r + dangling_nodes_ranks + ( 1  dumping_factor ) / N
> 
>
> What do you think?
>
> Paolo
>
> >
> > sebastian
> >
> >
> > On 17.05.2012 23:44, Paolo Castagna wrote:
> >> Hi Sebastian,
> >> from SimplePageRankVertex.java, we have:
> >>
> >> if (getSuperstep() < MAX_SUPERSTEPS) {
> >> long edges = getNumOutEdges();
> >> sendMsgToAllEdges(
> >> new DoubleWritable(getVertexValue().get() / edges));
> >> } else {
> >> voteToHalt();
> >> }
> >>
> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
> >>
> >> Paolo
> >>
> >> Sebastian Schelter wrote:
> >>> The implementation implicitly deals with dangling nodes by computing
> the
> >>> pageRank as
> >>>
> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
> >>>
> >>> Conceptually, this is the same as connecting every vertex with every
> >>> other vertex it is not yet connected to. This removes all dangling
> nodes
> >>> from the graph as every vertex can reach every other vertex.
> >>>
> >>> "Mining of Massive Datasets" [1] has a very well written explanation of
> >>> this approach.
> >>>
> >>> sebastian
> >>>
> >>>
> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
> >>>
> >>> On 17.05.2012 16:23, Paolo Castagna wrote:
> >>>> Hi,
> >>>> I noticed that the SimplePageRankVertex implementation does not deal
> with
> >>>> dangling nodes (dangling nodes are those nodes with no outgoing
> links).
> >>>> And, probably that is one of the good reasons why there is a "Simple"
> in front. ;)
> >>>>
> >>>> "Dangling nodes exist in many forms. For example, a page of data, a
> page with a
> >>>> postscript graph, [...], a page that has been fetched by a crawler
> but not yet
> >>>> explored  these are all examples of possible dangling nodes. The
> more ambitious
> >>>> the crawl, the bigger the proportion of dangling nodes because the
> set of
> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
> >>>>
> >>>> Wikipedia's PageRank page says:
> >>>>
> >>>> "If a page has no links to other pages, it becomes a sink and
> therefore
> >>>> terminates the random surfing process. If the random surfer arrives
> at a sink
> >>>> page, it picks another URL at random and continues surfing again. When
> >>>> calculating PageRank, pages with no outbound links are assumed to
> link out to
> >>>> all other pages in the collection. Their PageRank scores are
> therefore divided
> >>>> evenly among all other pages. In other words, to be fair with pages
> that are not
> >>>> sinks, these random transitions are added to all nodes in the Web,
> with a
> >>>> residual probability usually set to d = 0.85, estimated from the
> frequency that
> >>>> an average surfer uses his or her browser's bookmark feature." [2]
> >>>>
> >>>> Now, if I want to try to account for the dangling nodes I need to sum
> all
> >>>> PageRank values for the dangling nodes and divide it evenly among all
> other
> >>>> pages (which in practce means to send a message to all vertexes in
> Giraph...
> >>>> which is not possible, as it would be crazy with large graphs).
> >>>>
> >>>> I imagine one can use a SumAggregator and alternate computing
> contribution from
> >>>> dangling nodes and PageRank values (i.e. PageRank values are computed
> only every
> >>>> 2 supersteps).
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Paolo
> >>>>
> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
> Beyond: The
> >>>> Science of Search Engine Rankings" (2006)
> >>>> [2] http://en.wikipedia.org/wiki/PageRank
> >
>
