giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@googlemail.com>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Fri, 18 May 2012 10:31:18 GMT
Hi Gianmarco

Gianmarco De Francisci Morales wrote:
> Could you just accumulate the PR of dangling nodes in a variable and
> divide it equally in the next iteration?

Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator
and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank
values are
computed only every 2 supersteps).

You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of
all danling nodes.
Then, in the successive superstep you can use that divided by the number of nodes in the graph
to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below).

Another, change/enhancement could be to check for convergence (to do that, you need to keep
current and previous PageRank values) and spend another superstep just doing that.
But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer
a very simple example.
I did not started this thread with the aim to change SimplePageRankVertex, I think it should
stay as it is (and as the Google's Pregel paper describes it).

However, in practice, if you want to run PageRank or similar, you need to deal with problems
such as: what to do with dangling nodes/pages? how to know how many iterations you need to
reach the
precision you want/need? Etc.

My initial message was to exchange some ideas and get some feedback and/or suggestions on
how a not too simple PageRankVertex might look. ;-)

> If you use a preference vector instead of dividing it equally, then you
> have also the strongly preferential (and weakly preferential in case you
> divide equally) versions of RWR.

Yep, but how you get your preference vector? ;-)

As a first step, it would be great if we can come up with a PageRankVertex which takes into
account dangling nodes and perhaps/optionally check for convergence every few supersteps.

Anyway, thanks for your comments and feedback Gianmarco.

Paolo

> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> 
> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
> <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>>
> wrote:
> 
>     Sebastian Schelter wrote:
>     > You are right, the code would throw an exception here, but I guess it
>     > would be sufficient to simply add a check here and not send any
>     messages.
> 
>     That would avoid throwing an exception, but I am not sure it is
>     actually the best thing to do.
>     Some suggest to divide the PageRank scores of dangling nodes evenly
>     among all other pages.
>     If you want, this is equivalent to another random jump, but forced
>     by the fact that there are no outgoing links to select from.
>     Using your notation below:
> 
>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>     sumOfDanlingNodesRanks / getNumVertices() )
> 
>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
> 
>     In other works, in pseudo code a serial implementation which
>     accounts for dangling nodes should be:
> 
>     --------
>     dangling_nodes_ranks = 0
>     foreach node in dangling_node
>        dangling_nodes_ranks += ranks.get ( node )
>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
> 
>     foeach node1 in graph
>        r = 0
>        foreach node2 in incoming_links ( node1 )
>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
> 
>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>     ) / N
>     --------
> 
>     What do you think?
> 
>     Paolo
> 
>     >
>     > --sebastian
>     >
>     >
>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>     >> Hi Sebastian,
>     >> from SimplePageRankVertex.java, we have:
>     >>
>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>     >>       long edges = getNumOutEdges();
>     >>       sendMsgToAllEdges(
>     >>           new DoubleWritable(getVertexValue().get() / edges));
>     >>     } else {
>     >>       voteToHalt();
>     >>     }
>     >>
>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>     node)?
>     >>
>     >> Paolo
>     >>
>     >> Sebastian Schelter wrote:
>     >>> The implementation implicitly deals with dangling nodes by
>     computing the
>     >>> pageRank as
>     >>>
>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>     >>>
>     >>> Conceptually, this is the same as connecting every vertex with every
>     >>> other vertex it is not yet connected to. This removes all
>     dangling nodes
>     >>> from the graph as every vertex can reach every other vertex.
>     >>>
>     >>> "Mining of Massive Datasets" [1] has a very well written
>     explanation of
>     >>> this approach.
>     >>>
>     >>> --sebastian
>     >>>
>     >>>
>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>     >>>
>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>     >>>> Hi,
>     >>>> I noticed that the SimplePageRankVertex implementation does not
>     deal with
>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>     links).
>     >>>> And, probably that is one of the good reasons why there is a
>     "Simple" in front. ;-)
>     >>>>
>     >>>>  "Dangling nodes exist in many forms. For example, a page of
>     data, a page with a
>     >>>> postscript graph, [...], a page that has been fetched by a
>     crawler but not yet
>     >>>> explored - these are all examples of possible dangling nodes.
>     The more ambitious
>     >>>> the crawl, the bigger the proportion of dangling nodes because
>     the set of
>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>     >>>>
>     >>>> Wikipedia's PageRank page says:
>     >>>>
>     >>>>  "If a page has no links to other pages, it becomes a sink and
>     therefore
>     >>>> terminates the random surfing process. If the random surfer
>     arrives at a sink
>     >>>> page, it picks another URL at random and continues surfing
>     again. When
>     >>>> calculating PageRank, pages with no outbound links are assumed
>     to link out to
>     >>>> all other pages in the collection. Their PageRank scores are
>     therefore divided
>     >>>> evenly among all other pages. In other words, to be fair with
>     pages that are not
>     >>>> sinks, these random transitions are added to all nodes in the
>     Web, with a
>     >>>> residual probability usually set to d = 0.85, estimated from
>     the frequency that
>     >>>> an average surfer uses his or her browser's bookmark feature." [2]
>     >>>>
>     >>>> Now, if I want to try to account for the dangling nodes I need
>     to sum all
>     >>>> PageRank values for the dangling nodes and divide it evenly
>     among all other
>     >>>> pages (which in practce means to send a message to all vertexes
>     in Giraph...
>     >>>> which is not possible, as it would be crazy with large graphs).
>     >>>>
>     >>>> I imagine one can use a SumAggregator and alternate computing
>     contribution from
>     >>>> dangling nodes and PageRank values (i.e. PageRank values are
>     computed only every
>     >>>> 2 supersteps).
>     >>>>
>     >>>> What do you think?
>     >>>>
>     >>>> Paolo
>     >>>>
>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>     Beyond: The
>     >>>> Science of Search Engine Rankings" (2006)
>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>     >
> 
> 

Mime
View raw message