giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Date Fri, 18 May 2012 09:37:32 GMT
Accumulation should be possible with aggregators and the idea with
applying preference vectors seems very interesting too.

Maybe these should be implemented as additional features in GIRAPH-191?

On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote:
> Could you just accumulate the PR of dangling nodes in a variable and divide
> it equally in the next iteration?
> If you use a preference vector instead of dividing it equally, then you
> have also the strongly preferential (and weakly preferential in case you
> divide equally) versions of RWR.
> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> 
> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
> castagna.lists@googlemail.com> wrote:
> 
>> Sebastian Schelter wrote:
>>> You are right, the code would throw an exception here, but I guess it
>>> would be sufficient to simply add a check here and not send any messages.
>>
>> That would avoid throwing an exception, but I am not sure it is actually
>> the best thing to do.
>> Some suggest to divide the PageRank scores of dangling nodes evenly among
>> all other pages.
>> If you want, this is equivalent to another random jump, but forced by the
>> fact that there are no outgoing links to select from.
>> Using your notation below:
>>
>>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>> sumOfDanlingNodesRanks / getNumVertices() )
>>
>> See also pag. 38 of the book ""Google's PageRank and Beyond":
>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>
>> In other works, in pseudo code a serial implementation which accounts for
>> dangling nodes should be:
>>
>> --------
>> dangling_nodes_ranks = 0
>> foreach node in dangling_node
>>    dangling_nodes_ranks += ranks.get ( node )
>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>
>> foeach node1 in graph
>>    r = 0
>>    foreach node2 in incoming_links ( node1 )
>>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>
>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
>> --------
>>
>> What do you think?
>>
>> Paolo
>>
>>>
>>> --sebastian
>>>
>>>
>>> On 17.05.2012 23:44, Paolo Castagna wrote:
>>>> Hi Sebastian,
>>>> from SimplePageRankVertex.java, we have:
>>>>
>>>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>>       long edges = getNumOutEdges();
>>>>       sendMsgToAllEdges(
>>>>           new DoubleWritable(getVertexValue().get() / edges));
>>>>     } else {
>>>>       voteToHalt();
>>>>     }
>>>>
>>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>>>
>>>> Paolo
>>>>
>>>> Sebastian Schelter wrote:
>>>>> The implementation implicitly deals with dangling nodes by computing
>> the
>>>>> pageRank as
>>>>>
>>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>>>
>>>>> Conceptually, this is the same as connecting every vertex with every
>>>>> other vertex it is not yet connected to. This removes all dangling
>> nodes
>>>>> from the graph as every vertex can reach every other vertex.
>>>>>
>>>>> "Mining of Massive Datasets" [1] has a very well written explanation
of
>>>>> this approach.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>>>
>>>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>>>> Hi,
>>>>>> I noticed that the SimplePageRankVertex implementation does not deal
>> with
>>>>>> dangling nodes (dangling nodes are those nodes with no outgoing
>> links).
>>>>>> And, probably that is one of the good reasons why there is a "Simple"
>> in front. ;-)
>>>>>>
>>>>>>  "Dangling nodes exist in many forms. For example, a page of data,
a
>> page with a
>>>>>> postscript graph, [...], a page that has been fetched by a crawler
>> but not yet
>>>>>> explored - these are all examples of possible dangling nodes. The
>> more ambitious
>>>>>> the crawl, the bigger the proportion of dangling nodes because the
>> set of
>>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>>>
>>>>>> Wikipedia's PageRank page says:
>>>>>>
>>>>>>  "If a page has no links to other pages, it becomes a sink and
>> therefore
>>>>>> terminates the random surfing process. If the random surfer arrives
>> at a sink
>>>>>> page, it picks another URL at random and continues surfing again.
When
>>>>>> calculating PageRank, pages with no outbound links are assumed to
>> link out to
>>>>>> all other pages in the collection. Their PageRank scores are
>> therefore divided
>>>>>> evenly among all other pages. In other words, to be fair with pages
>> that are not
>>>>>> sinks, these random transitions are added to all nodes in the Web,
>> with a
>>>>>> residual probability usually set to d = 0.85, estimated from the
>> frequency that
>>>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>>>
>>>>>> Now, if I want to try to account for the dangling nodes I need to
sum
>> all
>>>>>> PageRank values for the dangling nodes and divide it evenly among
all
>> other
>>>>>> pages (which in practce means to send a message to all vertexes in
>> Giraph...
>>>>>> which is not possible, as it would be crazy with large graphs).
>>>>>>
>>>>>> I imagine one can use a SumAggregator and alternate computing
>> contribution from
>>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed
>> only every
>>>>>> 2 supersteps).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>> Beyond: The
>>>>>> Science of Search Engine Rankings" (2006)
>>>>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>>
>>
> 


Mime
View raw message