Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 77E84CA9D for ; Thu, 17 May 2012 21:44:56 +0000 (UTC) Received: (qmail 23091 invoked by uid 500); 17 May 2012 21:44:56 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 22940 invoked by uid 500); 17 May 2012 21:44:56 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 22926 invoked by uid 99); 17 May 2012 21:44:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 May 2012 21:44:56 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of castagna.lists@googlemail.com designates 209.85.212.177 as permitted sender) Received: from [209.85.212.177] (HELO mail-wi0-f177.google.com) (209.85.212.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 May 2012 21:44:51 +0000 Received: by wibhm14 with SMTP id hm14so1844240wib.0 for ; Thu, 17 May 2012 14:44:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=OzfW6L9KLthHJCMyYsF8L7zTylgs4uicWNcVbTnK2Ns=; b=G84DBdvg9Nb3hS/XW24HYuDWBylIAPL90anpihJlgifls+XvqkRCq5liS02tISsBcR ZspJSSJgQO/L4lIbObhOY3wTvFkvW/ZixCyQ1EjhhSciG8UF5AtekakNrB2YQLrlV5wl sx9bt8YjVPzzMowYW2TQo0/7zSOTX93Dh1vMZ4eoNN4oOht4/YESFQ2ycr+SMlm6AyQU x2HK9uKB5s9vh3QWTS4ChtckgYwUmYH2hP4iyJND4pqMKnVrrM+JnQqbga/DBdNsHHTy sgUI8sakEjx6GJUn64CgwGDjhMhfsrUl4yk3L+3cU26erX6wNTdMyuqwinYOG1vZLjzy b+BQ== Received: by 10.180.7.133 with SMTP id j5mr21913367wia.14.1337291069903; Thu, 17 May 2012 14:44:29 -0700 (PDT) Received: from [192.168.2.10] (80-42-205-238.dynamic.dsl.as9105.com. [80.42.205.238]) by mx.google.com with ESMTPS id j4sm45583531wiz.1.2012.05.17.14.44.28 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 May 2012 14:44:29 -0700 (PDT) Message-ID: <4FB5713B.2080504@googlemail.com> Date: Thu, 17 May 2012 22:44:27 +0100 From: Paolo Castagna User-Agent: Thunderbird 2.0.0.24 (X11/20101027) MIME-Version: 1.0 To: giraph-user@incubator.apache.org Subject: Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes... References: <4FB509F4.4040407@googlemail.com> <4FB52A7A.7030601@apache.org> In-Reply-To: <4FB52A7A.7030601@apache.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Sebastian, from SimplePageRankVertex.java, we have: if (getSuperstep() < MAX_SUPERSTEPS) { long edges = getNumOutEdges(); sendMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? Paolo Sebastian Schelter wrote: > The implementation implicitly deals with dangling nodes by computing the > pageRank as > > 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > > Conceptually, this is the same as connecting every vertex with every > other vertex it is not yet connected to. This removes all dangling nodes > from the graph as every vertex can reach every other vertex. > > "Mining of Massive Datasets" [1] has a very well written explanation of > this approach. > > --sebastian > > > [1] http://infolab.stanford.edu/~ullman/mmds.html > > On 17.05.2012 16:23, Paolo Castagna wrote: >> Hi, >> I noticed that the SimplePageRankVertex implementation does not deal with >> dangling nodes (dangling nodes are those nodes with no outgoing links). >> And, probably that is one of the good reasons why there is a "Simple" in front. ;-) >> >> "Dangling nodes exist in many forms. For example, a page of data, a page with a >> postscript graph, [...], a page that has been fetched by a crawler but not yet >> explored - these are all examples of possible dangling nodes. The more ambitious >> the crawl, the bigger the proportion of dangling nodes because the set of >> fetched but uncrawled pages grows quickly." (pag. 80) [1] >> >> Wikipedia's PageRank page says: >> >> "If a page has no links to other pages, it becomes a sink and therefore >> terminates the random surfing process. If the random surfer arrives at a sink >> page, it picks another URL at random and continues surfing again. When >> calculating PageRank, pages with no outbound links are assumed to link out to >> all other pages in the collection. Their PageRank scores are therefore divided >> evenly among all other pages. In other words, to be fair with pages that are not >> sinks, these random transitions are added to all nodes in the Web, with a >> residual probability usually set to d = 0.85, estimated from the frequency that >> an average surfer uses his or her browser's bookmark feature." [2] >> >> Now, if I want to try to account for the dangling nodes I need to sum all >> PageRank values for the dangling nodes and divide it evenly among all other >> pages (which in practce means to send a message to all vertexes in Giraph... >> which is not possible, as it would be crazy with large graphs). >> >> I imagine one can use a SumAggregator and alternate computing contribution from >> dangling nodes and PageRank values (i.e. PageRank values are computed only every >> 2 supersteps). >> >> What do you think? >> >> Paolo >> >> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >> Science of Search Engine Rankings" (2006) >> [2] http://en.wikipedia.org/wiki/PageRank >