incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@googlemail.com>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Fri, 18 May 2012 10:56:10 GMT
Hi Sebastian

Sebastian Schelter wrote:
> Very good points, could you add them to GIRAPH-191?

Done. Summarized with a link to this thread.

> I think that the convergence check does not need an extra superstep. It
> is sufficient to compute the L1-norm of the current pageRank vector,
> which can be done by an aggregator and compare it to the previous
> aggregated value.

Right. Much better. :-)

> I guess that summing up and redistributing the pagerank of dangling
> vertices can also be done without an extra superstep in an aggregator.

Yeah! Why didn't I think about that?
Thanks, great suggestion.

I am going to give this a go, at first without extending RandomWalkVertex since I want to
see how it might work.
This would also inform design and improvements of the current RandomWalkVertex.

Thanks for your comments and suggestions.

Paolo

> 
> --sebastian
> 
> On 18.05.2012 12:31, Paolo Castagna wrote:
>> Hi Gianmarco
>>
>> Gianmarco De Francisci Morales wrote:
>>> Could you just accumulate the PR of dangling nodes in a variable and
>>> divide it equally in the next iteration?
>> Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator
and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank
values are
>> computed only every 2 supersteps).
>>
>> You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank
of all danling nodes.
>> Then, in the successive superstep you can use that divided by the number of nodes
in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices()
below).
>>
>> Another, change/enhancement could be to check for convergence (to do that, you need
to keep current and previous PageRank values) and spend another superstep just doing that.
>> But, as you can see things stop being simple and the point of SimplePageRankVertex
is to offer a very simple example.
>> I did not started this thread with the aim to change SimplePageRankVertex, I think
it should stay as it is (and as the Google's Pregel paper describes it).
>>
>> However, in practice, if you want to run PageRank or similar, you need to deal with
problems such as: what to do with dangling nodes/pages? how to know how many iterations you
need to reach the
>> precision you want/need? Etc.
>>
>> My initial message was to exchange some ideas and get some feedback and/or suggestions
on how a not too simple PageRankVertex might look. ;-)
>>
>>> If you use a preference vector instead of dividing it equally, then you
>>> have also the strongly preferential (and weakly preferential in case you
>>> divide equally) versions of RWR.
>> Yep, but how you get your preference vector? ;-)
>>
>> As a first step, it would be great if we can come up with a PageRankVertex which
takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps.
>>
>> Anyway, thanks for your comments and feedback Gianmarco.
>>
>> Paolo
>>
>>> Cheers,
>>> --
>>> Gianmarco
>>>
>>>
>>>
>>>
>>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
>>> <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>>
>>> wrote:
>>>
>>>     Sebastian Schelter wrote:
>>>     > You are right, the code would throw an exception here, but I guess it
>>>     > would be sufficient to simply add a check here and not send any
>>>     messages.
>>>
>>>     That would avoid throwing an exception, but I am not sure it is
>>>     actually the best thing to do.
>>>     Some suggest to divide the PageRank scores of dangling nodes evenly
>>>     among all other pages.
>>>     If you want, this is equivalent to another random jump, but forced
>>>     by the fact that there are no outgoing links to select from.
>>>     Using your notation below:
>>>
>>>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>>     sumOfDanlingNodesRanks / getNumVertices() )
>>>
>>>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>>>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
>>>
>>>     In other works, in pseudo code a serial implementation which
>>>     accounts for dangling nodes should be:
>>>
>>>     --------
>>>     dangling_nodes_ranks = 0
>>>     foreach node in dangling_node
>>>        dangling_nodes_ranks += ranks.get ( node )
>>>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>>
>>>     foeach node1 in graph
>>>        r = 0
>>>        foreach node2 in incoming_links ( node1 )
>>>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>>
>>>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>>>     ) / N
>>>     --------
>>>
>>>     What do you think?
>>>
>>>     Paolo
>>>
>>>     >
>>>     > --sebastian
>>>     >
>>>     >
>>>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>>>     >> Hi Sebastian,
>>>     >> from SimplePageRankVertex.java, we have:
>>>     >>
>>>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>     >>       long edges = getNumOutEdges();
>>>     >>       sendMsgToAllEdges(
>>>     >>           new DoubleWritable(getVertexValue().get() / edges));
>>>     >>     } else {
>>>     >>       voteToHalt();
>>>     >>     }
>>>     >>
>>>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>>>     node)?
>>>     >>
>>>     >> Paolo
>>>     >>
>>>     >> Sebastian Schelter wrote:
>>>     >>> The implementation implicitly deals with dangling nodes by
>>>     computing the
>>>     >>> pageRank as
>>>     >>>
>>>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>     >>>
>>>     >>> Conceptually, this is the same as connecting every vertex with
every
>>>     >>> other vertex it is not yet connected to. This removes all
>>>     dangling nodes
>>>     >>> from the graph as every vertex can reach every other vertex.
>>>     >>>
>>>     >>> "Mining of Massive Datasets" [1] has a very well written
>>>     explanation of
>>>     >>> this approach.
>>>     >>>
>>>     >>> --sebastian
>>>     >>>
>>>     >>>
>>>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>     >>>
>>>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>     >>>> Hi,
>>>     >>>> I noticed that the SimplePageRankVertex implementation does
not
>>>     deal with
>>>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>>     links).
>>>     >>>> And, probably that is one of the good reasons why there
is a
>>>     "Simple" in front. ;-)
>>>     >>>>
>>>     >>>>  "Dangling nodes exist in many forms. For example, a page
of
>>>     data, a page with a
>>>     >>>> postscript graph, [...], a page that has been fetched by
a
>>>     crawler but not yet
>>>     >>>> explored - these are all examples of possible dangling nodes.
>>>     The more ambitious
>>>     >>>> the crawl, the bigger the proportion of dangling nodes because
>>>     the set of
>>>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>     >>>>
>>>     >>>> Wikipedia's PageRank page says:
>>>     >>>>
>>>     >>>>  "If a page has no links to other pages, it becomes a sink
and
>>>     therefore
>>>     >>>> terminates the random surfing process. If the random surfer
>>>     arrives at a sink
>>>     >>>> page, it picks another URL at random and continues surfing
>>>     again. When
>>>     >>>> calculating PageRank, pages with no outbound links are assumed
>>>     to link out to
>>>     >>>> all other pages in the collection. Their PageRank scores
are
>>>     therefore divided
>>>     >>>> evenly among all other pages. In other words, to be fair
with
>>>     pages that are not
>>>     >>>> sinks, these random transitions are added to all nodes in
the
>>>     Web, with a
>>>     >>>> residual probability usually set to d = 0.85, estimated
from
>>>     the frequency that
>>>     >>>> an average surfer uses his or her browser's bookmark feature."
[2]
>>>     >>>>
>>>     >>>> Now, if I want to try to account for the dangling nodes
I need
>>>     to sum all
>>>     >>>> PageRank values for the dangling nodes and divide it evenly
>>>     among all other
>>>     >>>> pages (which in practce means to send a message to all vertexes
>>>     in Giraph...
>>>     >>>> which is not possible, as it would be crazy with large graphs).
>>>     >>>>
>>>     >>>> I imagine one can use a SumAggregator and alternate computing
>>>     contribution from
>>>     >>>> dangling nodes and PageRank values (i.e. PageRank values
are
>>>     computed only every
>>>     >>>> 2 supersteps).
>>>     >>>>
>>>     >>>> What do you think?
>>>     >>>>
>>>     >>>> Paolo
>>>     >>>>
>>>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank
and
>>>     Beyond: The
>>>     >>>> Science of Search Engine Rankings" (2006)
>>>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>>     >
>>>
>>>
> 

Mime
View raw message