Benjamin Heitmann wrote: > On 21 May 2012, at 17:15, Paolo Castagna wrote: >> A more direct question would be: are Giraph supposed to extend >> BasicVertex<I,V,E,M> when they do not find a subclass of BasicVertex which meets >> their needs? > > > No, they are not. However this is not explicitly documented anywhere. > > Or to say it more clearly: If you look in the javadoc of BasicVertex and if you search the mailing lists, > then you will find that users are discouraged from using/extending BasicVertex, but you will not find any suggestion of which > Vertex to extend instead. I even asked basically the same question once, and got very indirect answers. > But that is okay, I figured it out by trial and error ;) > > > Users are supposed to extend HashMapVertex or EdgeListVertex (both in org.apache.giraph.graph). Hi Benjamin, right, I should have seen those (this is a good sign I should stop for today and continue tomorrow morning). I didn't because I was thinking: "I do not need my vertexes to be mutable" (since, computing PageRank does not need to change the topology of a graph), so I was not focusing my attention on the MutableVertex hierarchy of classes (my mistake). Now my question would be: why SimplePageRankVertex extends LongDoubleFloatDoubleVertex? (But, I'll look at this tomorrow). > > If you try to extend them, you will see that you can basically plug-in any kind of existing class in the <I,V,E,M> signature, > as long as the implement the right interfaces, which are all Writable (and WritableComparable for I). > Then you just need to add your compute() method, and you are ready to go. > Both HashMapVertex and EdgeListVertex provide implementations of all the housekeeping that giraph needs. > > Trying to work directly by extending BasicVertex will not work, as a lot of methods are only accessibly on the same package level. Yep. > It would probably be a good idea to submit a small javadoc patch which adds documentation to BasicVertex, that users need to look at those other two classes. > Yep. Thanks again for your help and for pointing me in the right direction. Paolo

On 21 May 2012, at 17:15, Paolo Castagna wrote: > A more direct question would be: are Giraph supposed to extend > BasicVertex<I,V,E,M> when they do not find a subclass of BasicVertex which meets > their needs? No, they are not. However this is not explicitly documented anywhere. Or to say it more clearly: If you look in the javadoc of BasicVertex and if you search the mailing lists, then you will find that users are discouraged from using/extending BasicVertex, but you will not find any suggestion of which Vertex to extend instead. I even asked basically the same question once, and got very indirect answers. But that is okay, I figured it out by trial and error ;) Users are supposed to extend HashMapVertex or EdgeListVertex (both in org.apache.giraph.graph). If you try to extend them, you will see that you can basically plug-in any kind of existing class in the <I,V,E,M> signature, as long as the implement the right interfaces, which are all Writable (and WritableComparable for I). Then you just need to add your compute() method, and you are ready to go. Both HashMapVertex and EdgeListVertex provide implementations of all the housekeeping that giraph needs. Trying to work directly by extending BasicVertex will not work, as a lot of methods are only accessibly on the same package level. It would probably be a good idea to submit a small javadoc patch which adds documentation to BasicVertex, that users need to look at those other two classes.

Hi, I am trying to write a different PageRank implementation which copes with dangling vertexes (i.e. vertexes with no edges), mistakes in the data (such as self-links and/or repetitions). For more on dangling vertexes, see also: GIRAPH-191 [1]. My test input data is something similar to: 1 1 2 2 2 3 5 7 9 11 3 3 3 6 ... This is an adjacency list with no edge values and no vertex value. Just vertex and target vertexs ids. Current SimplePageRankVertex extends LongDoubleFloatDoubleVertex which uses Float objects for edge values. But those values are unused. I could use that with 'fake' values for the edges. But I also would like to have generic String (i.e. Text) for vertex ids. So, I tried to write my own TextDoubleNullDoubleVertex which extends MutableVertex and/or BasicVertex. This is probably a mistake, right? Or, not? The problem I have with this approach is that BasicVertex methods putMessages(...) and releaseResources() are package private and my PageRankVertex is, obviously, not in the org.apache.giraph.graph package. Indeed, those methods seems to me more 'internal' stuff than something Giraph users should be aware of (so, I understand why they are package private). A more direct question would be: are Giraph supposed to extend BasicVertex<I,V,E,M> when they do not find a subclass of BasicVertex which meets their needs? Cheers, Paolo [1] https://issues.apache.org/jira/browse/GIRAPH-191

Hi Sebastian Sebastian Schelter wrote: > Very good points, could you add them to GIRAPH-191? Done. Summarized with a link to this thread. > I think that the convergence check does not need an extra superstep. It > is sufficient to compute the L1-norm of the current pageRank vector, > which can be done by an aggregator and compare it to the previous > aggregated value. Right. Much better. :-) > I guess that summing up and redistributing the pagerank of dangling > vertices can also be done without an extra superstep in an aggregator. Yeah! Why didn't I think about that? Thanks, great suggestion. I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work. This would also inform design and improvements of the current RandomWalkVertex. Thanks for your comments and suggestions. Paolo > > --sebastian > > On 18.05.2012 12:31, Paolo Castagna wrote: >> Hi Gianmarco >> >> Gianmarco De Francisci Morales wrote: >>> Could you just accumulate the PR of dangling nodes in a variable and >>> divide it equally in the next iteration? >> Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are >> computed only every 2 supersteps). >> >> You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes. >> Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below). >> >> Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that. >> But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example. >> I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it). >> >> However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the >> precision you want/need? Etc. >> >> My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-) >> >>> If you use a preference vector instead of dividing it equally, then you >>> have also the strongly preferential (and weakly preferential in case you >>> divide equally) versions of RWR. >> Yep, but how you get your preference vector? ;-) >> >> As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps. >> >> Anyway, thanks for your comments and feedback Gianmarco. >> >> Paolo >> >>> Cheers, >>> -- >>> Gianmarco >>> >>> >>> >>> >>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna >>> <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>> >>> wrote: >>> >>> Sebastian Schelter wrote: >>> > You are right, the code would throw an exception here, but I guess it >>> > would be sufficient to simply add a check here and not send any >>> messages. >>> >>> That would avoid throwing an exception, but I am not sure it is >>> actually the best thing to do. >>> Some suggest to divide the PageRank scores of dangling nodes evenly >>> among all other pages. >>> If you want, this is equivalent to another random jump, but forced >>> by the fact that there are no outgoing links to select from. >>> Using your notation below: >>> >>> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >>> sumOfDanlingNodesRanks / getNumVertices() ) >>> >>> See also pag. 38 of the book ""Google's PageRank and Beyond": >>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >>> <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38> >>> >>> In other works, in pseudo code a serial implementation which >>> accounts for dangling nodes should be: >>> >>> -------- >>> dangling_nodes_ranks = 0 >>> foreach node in dangling_node >>> dangling_nodes_ranks += ranks.get ( node ) >>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >>> >>> foeach node1 in graph >>> r = 0 >>> foreach node2 in incoming_links ( node1 ) >>> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >>> >>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor >>> ) / N >>> -------- >>> >>> What do you think? >>> >>> Paolo >>> >>> > >>> > --sebastian >>> > >>> > >>> > On 17.05.2012 23:44, Paolo Castagna wrote: >>> >> Hi Sebastian, >>> >> from SimplePageRankVertex.java, we have: >>> >> >>> >> if (getSuperstep() < MAX_SUPERSTEPS) { >>> >> long edges = getNumOutEdges(); >>> >> sendMsgToAllEdges( >>> >> new DoubleWritable(getVertexValue().get() / edges)); >>> >> } else { >>> >> voteToHalt(); >>> >> } >>> >> >>> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling >>> node)? >>> >> >>> >> Paolo >>> >> >>> >> Sebastian Schelter wrote: >>> >>> The implementation implicitly deals with dangling nodes by >>> computing the >>> >>> pageRank as >>> >>> >>> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>> >>> >>> >>> Conceptually, this is the same as connecting every vertex with every >>> >>> other vertex it is not yet connected to. This removes all >>> dangling nodes >>> >>> from the graph as every vertex can reach every other vertex. >>> >>> >>> >>> "Mining of Massive Datasets" [1] has a very well written >>> explanation of >>> >>> this approach. >>> >>> >>> >>> --sebastian >>> >>> >>> >>> >>> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>> >>> >>> >>> On 17.05.2012 16:23, Paolo Castagna wrote: >>> >>>> Hi, >>> >>>> I noticed that the SimplePageRankVertex implementation does not >>> deal with >>> >>>> dangling nodes (dangling nodes are those nodes with no outgoing >>> links). >>> >>>> And, probably that is one of the good reasons why there is a >>> "Simple" in front. ;-) >>> >>>> >>> >>>> "Dangling nodes exist in many forms. For example, a page of >>> data, a page with a >>> >>>> postscript graph, [...], a page that has been fetched by a >>> crawler but not yet >>> >>>> explored - these are all examples of possible dangling nodes. >>> The more ambitious >>> >>>> the crawl, the bigger the proportion of dangling nodes because >>> the set of >>> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>> >>>> >>> >>>> Wikipedia's PageRank page says: >>> >>>> >>> >>>> "If a page has no links to other pages, it becomes a sink and >>> therefore >>> >>>> terminates the random surfing process. If the random surfer >>> arrives at a sink >>> >>>> page, it picks another URL at random and continues surfing >>> again. When >>> >>>> calculating PageRank, pages with no outbound links are assumed >>> to link out to >>> >>>> all other pages in the collection. Their PageRank scores are >>> therefore divided >>> >>>> evenly among all other pages. In other words, to be fair with >>> pages that are not >>> >>>> sinks, these random transitions are added to all nodes in the >>> Web, with a >>> >>>> residual probability usually set to d = 0.85, estimated from >>> the frequency that >>> >>>> an average surfer uses his or her browser's bookmark feature." [2] >>> >>>> >>> >>>> Now, if I want to try to account for the dangling nodes I need >>> to sum all >>> >>>> PageRank values for the dangling nodes and divide it evenly >>> among all other >>> >>>> pages (which in practce means to send a message to all vertexes >>> in Giraph... >>> >>>> which is not possible, as it would be crazy with large graphs). >>> >>>> >>> >>>> I imagine one can use a SumAggregator and alternate computing >>> contribution from >>> >>>> dangling nodes and PageRank values (i.e. PageRank values are >>> computed only every >>> >>>> 2 supersteps). >>> >>>> >>> >>>> What do you think? >>> >>>> >>> >>>> Paolo >>> >>>> >>> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >>> Beyond: The >>> >>>> Science of Search Engine Rankings" (2006) >>> >>>> [2] http://en.wikipedia.org/wiki/PageRank >>> > >>> >>> >

Hi Paolo, Very good points, could you add them to GIRAPH-191? I think that the convergence check does not need an extra superstep. It is sufficient to compute the L1-norm of the current pageRank vector, which can be done by an aggregator and compare it to the previous aggregated value. I guess that summing up and redistributing the pagerank of dangling vertices can also be done without an extra superstep in an aggregator. --sebastian On 18.05.2012 12:31, Paolo Castagna wrote: > Hi Gianmarco > > Gianmarco De Francisci Morales wrote: >> Could you just accumulate the PR of dangling nodes in a variable and >> divide it equally in the next iteration? > > Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are > computed only every 2 supersteps). > > You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes. > Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below). > > Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that. > But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example. > I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it). > > However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the > precision you want/need? Etc. > > My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-) > >> If you use a preference vector instead of dividing it equally, then you >> have also the strongly preferential (and weakly preferential in case you >> divide equally) versions of RWR. > > Yep, but how you get your preference vector? ;-) > > As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps. > > Anyway, thanks for your comments and feedback Gianmarco. > > Paolo > >> >> Cheers, >> -- >> Gianmarco >> >> >> >> >> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna >> <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>> >> wrote: >> >> Sebastian Schelter wrote: >> > You are right, the code would throw an exception here, but I guess it >> > would be sufficient to simply add a check here and not send any >> messages. >> >> That would avoid throwing an exception, but I am not sure it is >> actually the best thing to do. >> Some suggest to divide the PageRank scores of dangling nodes evenly >> among all other pages. >> If you want, this is equivalent to another random jump, but forced >> by the fact that there are no outgoing links to select from. >> Using your notation below: >> >> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >> sumOfDanlingNodesRanks / getNumVertices() ) >> >> See also pag. 38 of the book ""Google's PageRank and Beyond": >> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >> <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38> >> >> In other works, in pseudo code a serial implementation which >> accounts for dangling nodes should be: >> >> -------- >> dangling_nodes_ranks = 0 >> foreach node in dangling_node >> dangling_nodes_ranks += ranks.get ( node ) >> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >> >> foeach node1 in graph >> r = 0 >> foreach node2 in incoming_links ( node1 ) >> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >> >> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor >> ) / N >> -------- >> >> What do you think? >> >> Paolo >> >> > >> > --sebastian >> > >> > >> > On 17.05.2012 23:44, Paolo Castagna wrote: >> >> Hi Sebastian, >> >> from SimplePageRankVertex.java, we have: >> >> >> >> if (getSuperstep() < MAX_SUPERSTEPS) { >> >> long edges = getNumOutEdges(); >> >> sendMsgToAllEdges( >> >> new DoubleWritable(getVertexValue().get() / edges)); >> >> } else { >> >> voteToHalt(); >> >> } >> >> >> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling >> node)? >> >> >> >> Paolo >> >> >> >> Sebastian Schelter wrote: >> >>> The implementation implicitly deals with dangling nodes by >> computing the >> >>> pageRank as >> >>> >> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >> >>> >> >>> Conceptually, this is the same as connecting every vertex with every >> >>> other vertex it is not yet connected to. This removes all >> dangling nodes >> >>> from the graph as every vertex can reach every other vertex. >> >>> >> >>> "Mining of Massive Datasets" [1] has a very well written >> explanation of >> >>> this approach. >> >>> >> >>> --sebastian >> >>> >> >>> >> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html >> >>> >> >>> On 17.05.2012 16:23, Paolo Castagna wrote: >> >>>> Hi, >> >>>> I noticed that the SimplePageRankVertex implementation does not >> deal with >> >>>> dangling nodes (dangling nodes are those nodes with no outgoing >> links). >> >>>> And, probably that is one of the good reasons why there is a >> "Simple" in front. ;-) >> >>>> >> >>>> "Dangling nodes exist in many forms. For example, a page of >> data, a page with a >> >>>> postscript graph, [...], a page that has been fetched by a >> crawler but not yet >> >>>> explored - these are all examples of possible dangling nodes. >> The more ambitious >> >>>> the crawl, the bigger the proportion of dangling nodes because >> the set of >> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >> >>>> >> >>>> Wikipedia's PageRank page says: >> >>>> >> >>>> "If a page has no links to other pages, it becomes a sink and >> therefore >> >>>> terminates the random surfing process. If the random surfer >> arrives at a sink >> >>>> page, it picks another URL at random and continues surfing >> again. When >> >>>> calculating PageRank, pages with no outbound links are assumed >> to link out to >> >>>> all other pages in the collection. Their PageRank scores are >> therefore divided >> >>>> evenly among all other pages. In other words, to be fair with >> pages that are not >> >>>> sinks, these random transitions are added to all nodes in the >> Web, with a >> >>>> residual probability usually set to d = 0.85, estimated from >> the frequency that >> >>>> an average surfer uses his or her browser's bookmark feature." [2] >> >>>> >> >>>> Now, if I want to try to account for the dangling nodes I need >> to sum all >> >>>> PageRank values for the dangling nodes and divide it evenly >> among all other >> >>>> pages (which in practce means to send a message to all vertexes >> in Giraph... >> >>>> which is not possible, as it would be crazy with large graphs). >> >>>> >> >>>> I imagine one can use a SumAggregator and alternate computing >> contribution from >> >>>> dangling nodes and PageRank values (i.e. PageRank values are >> computed only every >> >>>> 2 supersteps). >> >>>> >> >>>> What do you think? >> >>>> >> >>>> Paolo >> >>>> >> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >> Beyond: The >> >>>> Science of Search Engine Rankings" (2006) >> >>>> [2] http://en.wikipedia.org/wiki/PageRank >> > >> >>

Hi Sebastian Sebastian Schelter wrote: > Accumulation should be possible with aggregators and the idea with > applying preference vectors seems very interesting too. Indeed. But... - How big is the preference vector? (it's N elements, where N is the number of vertexes in your graph, right?) - How do we load that in (the Aggregator)? Can we do it just once at the beginning? - How user can determine the preference vector? (this is not something we need to be too much worried about, it's a user-land problem...) - ... So, I think dealing properly with dangling nodes is much more important first step. > Maybe these should be implemented as additional features in GIRAPH-191? Yeah, it would be good to have the capabilities in a RandomWalkVertex (as you showed) and maybe use that in a 'proper' PageRankVertex implementation (more configurable and which deals with dangling nodes and/or iterating until convergence is reached...) Paolo > > On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote: >> Could you just accumulate the PR of dangling nodes in a variable and divide >> it equally in the next iteration? >> If you use a preference vector instead of dividing it equally, then you >> have also the strongly preferential (and weakly preferential in case you >> divide equally) versions of RWR. >> >> Cheers, >> -- >> Gianmarco >> >> >> >> >> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna < >> castagna.lists@googlemail.com> wrote: >> >>> Sebastian Schelter wrote: >>>> You are right, the code would throw an exception here, but I guess it >>>> would be sufficient to simply add a check here and not send any messages. >>> That would avoid throwing an exception, but I am not sure it is actually >>> the best thing to do. >>> Some suggest to divide the PageRank scores of dangling nodes evenly among >>> all other pages. >>> If you want, this is equivalent to another random jump, but forced by the >>> fact that there are no outgoing links to select from. >>> Using your notation below: >>> >>> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >>> sumOfDanlingNodesRanks / getNumVertices() ) >>> >>> See also pag. 38 of the book ""Google's PageRank and Beyond": >>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >>> >>> In other works, in pseudo code a serial implementation which accounts for >>> dangling nodes should be: >>> >>> -------- >>> dangling_nodes_ranks = 0 >>> foreach node in dangling_node >>> dangling_nodes_ranks += ranks.get ( node ) >>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >>> >>> foeach node1 in graph >>> r = 0 >>> foreach node2 in incoming_links ( node1 ) >>> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >>> >>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N >>> -------- >>> >>> What do you think? >>> >>> Paolo >>> >>>> --sebastian >>>> >>>> >>>> On 17.05.2012 23:44, Paolo Castagna wrote: >>>>> Hi Sebastian, >>>>> from SimplePageRankVertex.java, we have: >>>>> >>>>> if (getSuperstep() < MAX_SUPERSTEPS) { >>>>> long edges = getNumOutEdges(); >>>>> sendMsgToAllEdges( >>>>> new DoubleWritable(getVertexValue().get() / edges)); >>>>> } else { >>>>> voteToHalt(); >>>>> } >>>>> >>>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? >>>>> >>>>> Paolo >>>>> >>>>> Sebastian Schelter wrote: >>>>>> The implementation implicitly deals with dangling nodes by computing >>> the >>>>>> pageRank as >>>>>> >>>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>>>>> >>>>>> Conceptually, this is the same as connecting every vertex with every >>>>>> other vertex it is not yet connected to. This removes all dangling >>> nodes >>>>>> from the graph as every vertex can reach every other vertex. >>>>>> >>>>>> "Mining of Massive Datasets" [1] has a very well written explanation of >>>>>> this approach. >>>>>> >>>>>> --sebastian >>>>>> >>>>>> >>>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>>>>> >>>>>> On 17.05.2012 16:23, Paolo Castagna wrote: >>>>>>> Hi, >>>>>>> I noticed that the SimplePageRankVertex implementation does not deal >>> with >>>>>>> dangling nodes (dangling nodes are those nodes with no outgoing >>> links). >>>>>>> And, probably that is one of the good reasons why there is a "Simple" >>> in front. ;-) >>>>>>> "Dangling nodes exist in many forms. For example, a page of data, a >>> page with a >>>>>>> postscript graph, [...], a page that has been fetched by a crawler >>> but not yet >>>>>>> explored - these are all examples of possible dangling nodes. The >>> more ambitious >>>>>>> the crawl, the bigger the proportion of dangling nodes because the >>> set of >>>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>>>>>> >>>>>>> Wikipedia's PageRank page says: >>>>>>> >>>>>>> "If a page has no links to other pages, it becomes a sink and >>> therefore >>>>>>> terminates the random surfing process. If the random surfer arrives >>> at a sink >>>>>>> page, it picks another URL at random and continues surfing again. When >>>>>>> calculating PageRank, pages with no outbound links are assumed to >>> link out to >>>>>>> all other pages in the collection. Their PageRank scores are >>> therefore divided >>>>>>> evenly among all other pages. In other words, to be fair with pages >>> that are not >>>>>>> sinks, these random transitions are added to all nodes in the Web, >>> with a >>>>>>> residual probability usually set to d = 0.85, estimated from the >>> frequency that >>>>>>> an average surfer uses his or her browser's bookmark feature." [2] >>>>>>> >>>>>>> Now, if I want to try to account for the dangling nodes I need to sum >>> all >>>>>>> PageRank values for the dangling nodes and divide it evenly among all >>> other >>>>>>> pages (which in practce means to send a message to all vertexes in >>> Giraph... >>>>>>> which is not possible, as it would be crazy with large graphs). >>>>>>> >>>>>>> I imagine one can use a SumAggregator and alternate computing >>> contribution from >>>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed >>> only every >>>>>>> 2 supersteps). >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Paolo >>>>>>> >>>>>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >>> Beyond: The >>>>>>> Science of Search Engine Rankings" (2006) >>>>>>> [2] http://en.wikipedia.org/wiki/PageRank >

Hi Gianmarco Gianmarco De Francisci Morales wrote: > Could you just accumulate the PR of dangling nodes in a variable and > divide it equally in the next iteration? Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are computed only every 2 supersteps). You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes. Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below). Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that. But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example. I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it). However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the precision you want/need? Etc. My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-) > If you use a preference vector instead of dividing it equally, then you > have also the strongly preferential (and weakly preferential in case you > divide equally) versions of RWR. Yep, but how you get your preference vector? ;-) As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps. Anyway, thanks for your comments and feedback Gianmarco. Paolo > > Cheers, > -- > Gianmarco > > > > > On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna > <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>> > wrote: > > Sebastian Schelter wrote: > > You are right, the code would throw an exception here, but I guess it > > would be sufficient to simply add a check here and not send any > messages. > > That would avoid throwing an exception, but I am not sure it is > actually the best thing to do. > Some suggest to divide the PageRank scores of dangling nodes evenly > among all other pages. > If you want, this is equivalent to another random jump, but forced > by the fact that there are no outgoing links to select from. > Using your notation below: > > 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + > sumOfDanlingNodesRanks / getNumVertices() ) > > See also pag. 38 of the book ""Google's PageRank and Beyond": > http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 > <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38> > > In other works, in pseudo code a serial implementation which > accounts for dangling nodes should be: > > -------- > dangling_nodes_ranks = 0 > foreach node in dangling_node > dangling_nodes_ranks += ranks.get ( node ) > dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; > > foeach node1 in graph > r = 0 > foreach node2 in incoming_links ( node1 ) > r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) > > r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor > ) / N > -------- > > What do you think? > > Paolo > > > > > --sebastian > > > > > > On 17.05.2012 23:44, Paolo Castagna wrote: > >> Hi Sebastian, > >> from SimplePageRankVertex.java, we have: > >> > >> if (getSuperstep() < MAX_SUPERSTEPS) { > >> long edges = getNumOutEdges(); > >> sendMsgToAllEdges( > >> new DoubleWritable(getVertexValue().get() / edges)); > >> } else { > >> voteToHalt(); > >> } > >> > >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling > node)? > >> > >> Paolo > >> > >> Sebastian Schelter wrote: > >>> The implementation implicitly deals with dangling nodes by > computing the > >>> pageRank as > >>> > >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > >>> > >>> Conceptually, this is the same as connecting every vertex with every > >>> other vertex it is not yet connected to. This removes all > dangling nodes > >>> from the graph as every vertex can reach every other vertex. > >>> > >>> "Mining of Massive Datasets" [1] has a very well written > explanation of > >>> this approach. > >>> > >>> --sebastian > >>> > >>> > >>> [1] http://infolab.stanford.edu/~ullman/mmds.html > >>> > >>> On 17.05.2012 16:23, Paolo Castagna wrote: > >>>> Hi, > >>>> I noticed that the SimplePageRankVertex implementation does not > deal with > >>>> dangling nodes (dangling nodes are those nodes with no outgoing > links). > >>>> And, probably that is one of the good reasons why there is a > "Simple" in front. ;-) > >>>> > >>>> "Dangling nodes exist in many forms. For example, a page of > data, a page with a > >>>> postscript graph, [...], a page that has been fetched by a > crawler but not yet > >>>> explored - these are all examples of possible dangling nodes. > The more ambitious > >>>> the crawl, the bigger the proportion of dangling nodes because > the set of > >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] > >>>> > >>>> Wikipedia's PageRank page says: > >>>> > >>>> "If a page has no links to other pages, it becomes a sink and > therefore > >>>> terminates the random surfing process. If the random surfer > arrives at a sink > >>>> page, it picks another URL at random and continues surfing > again. When > >>>> calculating PageRank, pages with no outbound links are assumed > to link out to > >>>> all other pages in the collection. Their PageRank scores are > therefore divided > >>>> evenly among all other pages. In other words, to be fair with > pages that are not > >>>> sinks, these random transitions are added to all nodes in the > Web, with a > >>>> residual probability usually set to d = 0.85, estimated from > the frequency that > >>>> an average surfer uses his or her browser's bookmark feature." [2] > >>>> > >>>> Now, if I want to try to account for the dangling nodes I need > to sum all > >>>> PageRank values for the dangling nodes and divide it evenly > among all other > >>>> pages (which in practce means to send a message to all vertexes > in Giraph... > >>>> which is not possible, as it would be crazy with large graphs). > >>>> > >>>> I imagine one can use a SumAggregator and alternate computing > contribution from > >>>> dangling nodes and PageRank values (i.e. PageRank values are > computed only every > >>>> 2 supersteps). > >>>> > >>>> What do you think? > >>>> > >>>> Paolo > >>>> > >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and > Beyond: The > >>>> Science of Search Engine Rankings" (2006) > >>>> [2] http://en.wikipedia.org/wiki/PageRank > > > >

Accumulation should be possible with aggregators and the idea with applying preference vectors seems very interesting too. Maybe these should be implemented as additional features in GIRAPH-191? On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote: > Could you just accumulate the PR of dangling nodes in a variable and divide > it equally in the next iteration? > If you use a preference vector instead of dividing it equally, then you > have also the strongly preferential (and weakly preferential in case you > divide equally) versions of RWR. > > Cheers, > -- > Gianmarco > > > > > On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna < > castagna.lists@googlemail.com> wrote: > >> Sebastian Schelter wrote: >>> You are right, the code would throw an exception here, but I guess it >>> would be sufficient to simply add a check here and not send any messages. >> >> That would avoid throwing an exception, but I am not sure it is actually >> the best thing to do. >> Some suggest to divide the PageRank scores of dangling nodes evenly among >> all other pages. >> If you want, this is equivalent to another random jump, but forced by the >> fact that there are no outgoing links to select from. >> Using your notation below: >> >> 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + >> sumOfDanlingNodesRanks / getNumVertices() ) >> >> See also pag. 38 of the book ""Google's PageRank and Beyond": >> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 >> >> In other works, in pseudo code a serial implementation which accounts for >> dangling nodes should be: >> >> -------- >> dangling_nodes_ranks = 0 >> foreach node in dangling_node >> dangling_nodes_ranks += ranks.get ( node ) >> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; >> >> foeach node1 in graph >> r = 0 >> foreach node2 in incoming_links ( node1 ) >> r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) >> >> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N >> -------- >> >> What do you think? >> >> Paolo >> >>> >>> --sebastian >>> >>> >>> On 17.05.2012 23:44, Paolo Castagna wrote: >>>> Hi Sebastian, >>>> from SimplePageRankVertex.java, we have: >>>> >>>> if (getSuperstep() < MAX_SUPERSTEPS) { >>>> long edges = getNumOutEdges(); >>>> sendMsgToAllEdges( >>>> new DoubleWritable(getVertexValue().get() / edges)); >>>> } else { >>>> voteToHalt(); >>>> } >>>> >>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? >>>> >>>> Paolo >>>> >>>> Sebastian Schelter wrote: >>>>> The implementation implicitly deals with dangling nodes by computing >> the >>>>> pageRank as >>>>> >>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>>>> >>>>> Conceptually, this is the same as connecting every vertex with every >>>>> other vertex it is not yet connected to. This removes all dangling >> nodes >>>>> from the graph as every vertex can reach every other vertex. >>>>> >>>>> "Mining of Massive Datasets" [1] has a very well written explanation of >>>>> this approach. >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>>>> >>>>> On 17.05.2012 16:23, Paolo Castagna wrote: >>>>>> Hi, >>>>>> I noticed that the SimplePageRankVertex implementation does not deal >> with >>>>>> dangling nodes (dangling nodes are those nodes with no outgoing >> links). >>>>>> And, probably that is one of the good reasons why there is a "Simple" >> in front. ;-) >>>>>> >>>>>> "Dangling nodes exist in many forms. For example, a page of data, a >> page with a >>>>>> postscript graph, [...], a page that has been fetched by a crawler >> but not yet >>>>>> explored - these are all examples of possible dangling nodes. The >> more ambitious >>>>>> the crawl, the bigger the proportion of dangling nodes because the >> set of >>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>>>>> >>>>>> Wikipedia's PageRank page says: >>>>>> >>>>>> "If a page has no links to other pages, it becomes a sink and >> therefore >>>>>> terminates the random surfing process. If the random surfer arrives >> at a sink >>>>>> page, it picks another URL at random and continues surfing again. When >>>>>> calculating PageRank, pages with no outbound links are assumed to >> link out to >>>>>> all other pages in the collection. Their PageRank scores are >> therefore divided >>>>>> evenly among all other pages. In other words, to be fair with pages >> that are not >>>>>> sinks, these random transitions are added to all nodes in the Web, >> with a >>>>>> residual probability usually set to d = 0.85, estimated from the >> frequency that >>>>>> an average surfer uses his or her browser's bookmark feature." [2] >>>>>> >>>>>> Now, if I want to try to account for the dangling nodes I need to sum >> all >>>>>> PageRank values for the dangling nodes and divide it evenly among all >> other >>>>>> pages (which in practce means to send a message to all vertexes in >> Giraph... >>>>>> which is not possible, as it would be crazy with large graphs). >>>>>> >>>>>> I imagine one can use a SumAggregator and alternate computing >> contribution from >>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed >> only every >>>>>> 2 supersteps). >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Paolo >>>>>> >>>>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and >> Beyond: The >>>>>> Science of Search Engine Rankings" (2006) >>>>>> [2] http://en.wikipedia.org/wiki/PageRank >>> >> >

Could you just accumulate the PR of dangling nodes in a variable and divide it equally in the next iteration? If you use a preference vector instead of dividing it equally, then you have also the strongly preferential (and weakly preferential in case you divide equally) versions of RWR. Cheers, -- Gianmarco On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna < castagna.lists@googlemail.com> wrote: > Sebastian Schelter wrote: > > You are right, the code would throw an exception here, but I guess it > > would be sufficient to simply add a check here and not send any messages. > > That would avoid throwing an exception, but I am not sure it is actually > the best thing to do. > Some suggest to divide the PageRank scores of dangling nodes evenly among > all other pages. > If you want, this is equivalent to another random jump, but forced by the > fact that there are no outgoing links to select from. > Using your notation below: > > 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + > sumOfDanlingNodesRanks / getNumVertices() ) > > See also pag. 38 of the book ""Google's PageRank and Beyond": > http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 > > In other works, in pseudo code a serial implementation which accounts for > dangling nodes should be: > > -------- > dangling_nodes_ranks = 0 > foreach node in dangling_node > dangling_nodes_ranks += ranks.get ( node ) > dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; > > foeach node1 in graph > r = 0 > foreach node2 in incoming_links ( node1 ) > r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) > > r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N > -------- > > What do you think? > > Paolo > > > > > --sebastian > > > > > > On 17.05.2012 23:44, Paolo Castagna wrote: > >> Hi Sebastian, > >> from SimplePageRankVertex.java, we have: > >> > >> if (getSuperstep() < MAX_SUPERSTEPS) { > >> long edges = getNumOutEdges(); > >> sendMsgToAllEdges( > >> new DoubleWritable(getVertexValue().get() / edges)); > >> } else { > >> voteToHalt(); > >> } > >> > >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? > >> > >> Paolo > >> > >> Sebastian Schelter wrote: > >>> The implementation implicitly deals with dangling nodes by computing > the > >>> pageRank as > >>> > >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > >>> > >>> Conceptually, this is the same as connecting every vertex with every > >>> other vertex it is not yet connected to. This removes all dangling > nodes > >>> from the graph as every vertex can reach every other vertex. > >>> > >>> "Mining of Massive Datasets" [1] has a very well written explanation of > >>> this approach. > >>> > >>> --sebastian > >>> > >>> > >>> [1] http://infolab.stanford.edu/~ullman/mmds.html > >>> > >>> On 17.05.2012 16:23, Paolo Castagna wrote: > >>>> Hi, > >>>> I noticed that the SimplePageRankVertex implementation does not deal > with > >>>> dangling nodes (dangling nodes are those nodes with no outgoing > links). > >>>> And, probably that is one of the good reasons why there is a "Simple" > in front. ;-) > >>>> > >>>> "Dangling nodes exist in many forms. For example, a page of data, a > page with a > >>>> postscript graph, [...], a page that has been fetched by a crawler > but not yet > >>>> explored - these are all examples of possible dangling nodes. The > more ambitious > >>>> the crawl, the bigger the proportion of dangling nodes because the > set of > >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] > >>>> > >>>> Wikipedia's PageRank page says: > >>>> > >>>> "If a page has no links to other pages, it becomes a sink and > therefore > >>>> terminates the random surfing process. If the random surfer arrives > at a sink > >>>> page, it picks another URL at random and continues surfing again. When > >>>> calculating PageRank, pages with no outbound links are assumed to > link out to > >>>> all other pages in the collection. Their PageRank scores are > therefore divided > >>>> evenly among all other pages. In other words, to be fair with pages > that are not > >>>> sinks, these random transitions are added to all nodes in the Web, > with a > >>>> residual probability usually set to d = 0.85, estimated from the > frequency that > >>>> an average surfer uses his or her browser's bookmark feature." [2] > >>>> > >>>> Now, if I want to try to account for the dangling nodes I need to sum > all > >>>> PageRank values for the dangling nodes and divide it evenly among all > other > >>>> pages (which in practce means to send a message to all vertexes in > Giraph... > >>>> which is not possible, as it would be crazy with large graphs). > >>>> > >>>> I imagine one can use a SumAggregator and alternate computing > contribution from > >>>> dangling nodes and PageRank values (i.e. PageRank values are computed > only every > >>>> 2 supersteps). > >>>> > >>>> What do you think? > >>>> > >>>> Paolo > >>>> > >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and > Beyond: The > >>>> Science of Search Engine Rankings" (2006) > >>>> [2] http://en.wikipedia.org/wiki/PageRank > > >

Sebastian Schelter wrote: > You are right, the code would throw an exception here, but I guess it > would be sufficient to simply add a check here and not send any messages. That would avoid throwing an exception, but I am not sure it is actually the best thing to do. Some suggest to divide the PageRank scores of dangling nodes evenly among all other pages. If you want, this is equivalent to another random jump, but forced by the fact that there are no outgoing links to select from. Using your notation below: 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + sumOfDanlingNodesRanks / getNumVertices() ) See also pag. 38 of the book ""Google's PageRank and Beyond": http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38 In other works, in pseudo code a serial implementation which accounts for dangling nodes should be: -------- dangling_nodes_ranks = 0 foreach node in dangling_node dangling_nodes_ranks += ranks.get ( node ) dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; foeach node1 in graph r = 0 foreach node2 in incoming_links ( node1 ) r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N -------- What do you think? Paolo > > --sebastian > > > On 17.05.2012 23:44, Paolo Castagna wrote: >> Hi Sebastian, >> from SimplePageRankVertex.java, we have: >> >> if (getSuperstep() < MAX_SUPERSTEPS) { >> long edges = getNumOutEdges(); >> sendMsgToAllEdges( >> new DoubleWritable(getVertexValue().get() / edges)); >> } else { >> voteToHalt(); >> } >> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? >> >> Paolo >> >> Sebastian Schelter wrote: >>> The implementation implicitly deals with dangling nodes by computing the >>> pageRank as >>> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >>> >>> Conceptually, this is the same as connecting every vertex with every >>> other vertex it is not yet connected to. This removes all dangling nodes >>> from the graph as every vertex can reach every other vertex. >>> >>> "Mining of Massive Datasets" [1] has a very well written explanation of >>> this approach. >>> >>> --sebastian >>> >>> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html >>> >>> On 17.05.2012 16:23, Paolo Castagna wrote: >>>> Hi, >>>> I noticed that the SimplePageRankVertex implementation does not deal with >>>> dangling nodes (dangling nodes are those nodes with no outgoing links). >>>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-) >>>> >>>> "Dangling nodes exist in many forms. For example, a page of data, a page with a >>>> postscript graph, [...], a page that has been fetched by a crawler but not yet >>>> explored - these are all examples of possible dangling nodes. The more ambitious >>>> the crawl, the bigger the proportion of dangling nodes because the set of >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>>> >>>> Wikipedia's PageRank page says: >>>> >>>> "If a page has no links to other pages, it becomes a sink and therefore >>>> terminates the random surfing process. If the random surfer arrives at a sink >>>> page, it picks another URL at random and continues surfing again. When >>>> calculating PageRank, pages with no outbound links are assumed to link out to >>>> all other pages in the collection. Their PageRank scores are therefore divided >>>> evenly among all other pages. In other words, to be fair with pages that are not >>>> sinks, these random transitions are added to all nodes in the Web, with a >>>> residual probability usually set to d = 0.85, estimated from the frequency that >>>> an average surfer uses his or her browser's bookmark feature." [2] >>>> >>>> Now, if I want to try to account for the dangling nodes I need to sum all >>>> PageRank values for the dangling nodes and divide it evenly among all other >>>> pages (which in practce means to send a message to all vertexes in Giraph... >>>> which is not possible, as it would be crazy with large graphs). >>>> >>>> I imagine one can use a SumAggregator and alternate computing contribution from >>>> dangling nodes and PageRank values (i.e. PageRank values are computed only every >>>> 2 supersteps). >>>> >>>> What do you think? >>>> >>>> Paolo >>>> >>>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >>>> Science of Search Engine Rankings" (2006) >>>> [2] http://en.wikipedia.org/wiki/PageRank >

You are right, the code would throw an exception here, but I guess it would be sufficient to simply add a check here and not send any messages. --sebastian On 17.05.2012 23:44, Paolo Castagna wrote: > Hi Sebastian, > from SimplePageRankVertex.java, we have: > > if (getSuperstep() < MAX_SUPERSTEPS) { > long edges = getNumOutEdges(); > sendMsgToAllEdges( > new DoubleWritable(getVertexValue().get() / edges)); > } else { > voteToHalt(); > } > > What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? > > Paolo > > Sebastian Schelter wrote: >> The implementation implicitly deals with dangling nodes by computing the >> pageRank as >> >> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks >> >> Conceptually, this is the same as connecting every vertex with every >> other vertex it is not yet connected to. This removes all dangling nodes >> from the graph as every vertex can reach every other vertex. >> >> "Mining of Massive Datasets" [1] has a very well written explanation of >> this approach. >> >> --sebastian >> >> >> [1] http://infolab.stanford.edu/~ullman/mmds.html >> >> On 17.05.2012 16:23, Paolo Castagna wrote: >>> Hi, >>> I noticed that the SimplePageRankVertex implementation does not deal with >>> dangling nodes (dangling nodes are those nodes with no outgoing links). >>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-) >>> >>> "Dangling nodes exist in many forms. For example, a page of data, a page with a >>> postscript graph, [...], a page that has been fetched by a crawler but not yet >>> explored - these are all examples of possible dangling nodes. The more ambitious >>> the crawl, the bigger the proportion of dangling nodes because the set of >>> fetched but uncrawled pages grows quickly." (pag. 80) [1] >>> >>> Wikipedia's PageRank page says: >>> >>> "If a page has no links to other pages, it becomes a sink and therefore >>> terminates the random surfing process. If the random surfer arrives at a sink >>> page, it picks another URL at random and continues surfing again. When >>> calculating PageRank, pages with no outbound links are assumed to link out to >>> all other pages in the collection. Their PageRank scores are therefore divided >>> evenly among all other pages. In other words, to be fair with pages that are not >>> sinks, these random transitions are added to all nodes in the Web, with a >>> residual probability usually set to d = 0.85, estimated from the frequency that >>> an average surfer uses his or her browser's bookmark feature." [2] >>> >>> Now, if I want to try to account for the dangling nodes I need to sum all >>> PageRank values for the dangling nodes and divide it evenly among all other >>> pages (which in practce means to send a message to all vertexes in Giraph... >>> which is not possible, as it would be crazy with large graphs). >>> >>> I imagine one can use a SumAggregator and alternate computing contribution from >>> dangling nodes and PageRank values (i.e. PageRank values are computed only every >>> 2 supersteps). >>> >>> What do you think? >>> >>> Paolo >>> >>> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >>> Science of Search Engine Rankings" (2006) >>> [2] http://en.wikipedia.org/wiki/PageRank >>

Hi Sebastian, from SimplePageRankVertex.java, we have: if (getSuperstep() < MAX_SUPERSTEPS) { long edges = getNumOutEdges(); sendMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? Paolo Sebastian Schelter wrote: > The implementation implicitly deals with dangling nodes by computing the > pageRank as > > 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks > > Conceptually, this is the same as connecting every vertex with every > other vertex it is not yet connected to. This removes all dangling nodes > from the graph as every vertex can reach every other vertex. > > "Mining of Massive Datasets" [1] has a very well written explanation of > this approach. > > --sebastian > > > [1] http://infolab.stanford.edu/~ullman/mmds.html > > On 17.05.2012 16:23, Paolo Castagna wrote: >> Hi, >> I noticed that the SimplePageRankVertex implementation does not deal with >> dangling nodes (dangling nodes are those nodes with no outgoing links). >> And, probably that is one of the good reasons why there is a "Simple" in front. ;-) >> >> "Dangling nodes exist in many forms. For example, a page of data, a page with a >> postscript graph, [...], a page that has been fetched by a crawler but not yet >> explored - these are all examples of possible dangling nodes. The more ambitious >> the crawl, the bigger the proportion of dangling nodes because the set of >> fetched but uncrawled pages grows quickly." (pag. 80) [1] >> >> Wikipedia's PageRank page says: >> >> "If a page has no links to other pages, it becomes a sink and therefore >> terminates the random surfing process. If the random surfer arrives at a sink >> page, it picks another URL at random and continues surfing again. When >> calculating PageRank, pages with no outbound links are assumed to link out to >> all other pages in the collection. Their PageRank scores are therefore divided >> evenly among all other pages. In other words, to be fair with pages that are not >> sinks, these random transitions are added to all nodes in the Web, with a >> residual probability usually set to d = 0.85, estimated from the frequency that >> an average surfer uses his or her browser's bookmark feature." [2] >> >> Now, if I want to try to account for the dangling nodes I need to sum all >> PageRank values for the dangling nodes and divide it evenly among all other >> pages (which in practce means to send a message to all vertexes in Giraph... >> which is not possible, as it would be crazy with large graphs). >> >> I imagine one can use a SumAggregator and alternate computing contribution from >> dangling nodes and PageRank values (i.e. PageRank values are computed only every >> 2 supersteps). >> >> What do you think? >> >> Paolo >> >> [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The >> Science of Search Engine Rankings" (2006) >> [2] http://en.wikipedia.org/wiki/PageRank >

Great! Created GIRAPH-191 Cheers, -- Gianmarco On Thu, May 17, 2012 at 11:02 PM, Sebastian Schelter <ssc@apache.org> wrote: > Hi, > > you are completely right. I also started implementing RWR today > coincidently, could you file a JIRA ticket for RWR? I would attach my > work done so far and we could work a little on the code. > > I think that a lot of things need improvement in the PageRank/RWR > implementation, e.g. stuff like the teleportation probability should be > configurable. Furthermore you shouldn't have to specify the number of > supersteps that need to be executed, but convergence should be checked > somehow via an aggregator. > > > Best, > Sebastian > > > On 17.05.2012 22:58, Gianmarco De Francisci Morales wrote: > > Hi Giraphers, > > > > I am implementing a Random Walk with Restart on Giraph. > > As far as I have understood, the only thing needed would be to modify > > PageRank in order to take into account the preference vector. > > This means all random jumps get back to the source of the RWR. > > In practice, in org/apache/giraph/examples/SimplePageRankVertex.java the > > new vertex value is computed as: > > > > DoubleWritable vertexValue = new DoubleWritable((0.15f / > > getNumVertices()) + 0.85f * sum); > > > > And the only thing I should do to implement the RWR is > > > > if ( myID == sourceID ) > > DoubleWritable vertexValue = new DoubleWritable((0.15f + 0.85f * > sum); > > else > > DoubleWritable vertexValue = new DoubleWritable(0.85f * sum); > > > > Because all the random jumps converge on the single source. > > Am I correct or am I missing something? > > > > Cheers, > > -- > > Gianmarco > > > >

Hi, you are completely right. I also started implementing RWR today coincidently, could you file a JIRA ticket for RWR? I would attach my work done so far and we could work a little on the code. I think that a lot of things need improvement in the PageRank/RWR implementation, e.g. stuff like the teleportation probability should be configurable. Furthermore you shouldn't have to specify the number of supersteps that need to be executed, but convergence should be checked somehow via an aggregator. Best, Sebastian On 17.05.2012 22:58, Gianmarco De Francisci Morales wrote: > Hi Giraphers, > > I am implementing a Random Walk with Restart on Giraph. > As far as I have understood, the only thing needed would be to modify > PageRank in order to take into account the preference vector. > This means all random jumps get back to the source of the RWR. > In practice, in org/apache/giraph/examples/SimplePageRankVertex.java the > new vertex value is computed as: > > DoubleWritable vertexValue = new DoubleWritable((0.15f / > getNumVertices()) + 0.85f * sum); > > And the only thing I should do to implement the RWR is > > if ( myID == sourceID ) > DoubleWritable vertexValue = new DoubleWritable((0.15f + 0.85f * sum); > else > DoubleWritable vertexValue = new DoubleWritable(0.85f * sum); > > Because all the random jumps converge on the single source. > Am I correct or am I missing something? > > Cheers, > -- > Gianmarco >

Hi Giraphers, I am implementing a Random Walk with Restart on Giraph. As far as I have understood, the only thing needed would be to modify PageRank in order to take into account the preference vector. This means all random jumps get back to the source of the RWR. In practice, in org/apache/giraph/examples/SimplePageRankVertex.java the new vertex value is computed as: DoubleWritable vertexValue = new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); And the only thing I should do to implement the RWR is if ( myID == sourceID ) DoubleWritable vertexValue = new DoubleWritable((0.15f + 0.85f * sum); else DoubleWritable vertexValue = new DoubleWritable(0.85f * sum); Because all the random jumps converge on the single source. Am I correct or am I missing something? Cheers, -- Gianmarco

The implementation implicitly deals with dangling nodes by computing the pageRank as 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks Conceptually, this is the same as connecting every vertex with every other vertex it is not yet connected to. This removes all dangling nodes from the graph as every vertex can reach every other vertex. "Mining of Massive Datasets" [1] has a very well written explanation of this approach. --sebastian [1] http://infolab.stanford.edu/~ullman/mmds.html On 17.05.2012 16:23, Paolo Castagna wrote: > Hi, > I noticed that the SimplePageRankVertex implementation does not deal with > dangling nodes (dangling nodes are those nodes with no outgoing links). > And, probably that is one of the good reasons why there is a "Simple" in front. ;-) > > "Dangling nodes exist in many forms. For example, a page of data, a page with a > postscript graph, [...], a page that has been fetched by a crawler but not yet > explored - these are all examples of possible dangling nodes. The more ambitious > the crawl, the bigger the proportion of dangling nodes because the set of > fetched but uncrawled pages grows quickly." (pag. 80) [1] > > Wikipedia's PageRank page says: > > "If a page has no links to other pages, it becomes a sink and therefore > terminates the random surfing process. If the random surfer arrives at a sink > page, it picks another URL at random and continues surfing again. When > calculating PageRank, pages with no outbound links are assumed to link out to > all other pages in the collection. Their PageRank scores are therefore divided > evenly among all other pages. In other words, to be fair with pages that are not > sinks, these random transitions are added to all nodes in the Web, with a > residual probability usually set to d = 0.85, estimated from the frequency that > an average surfer uses his or her browser's bookmark feature." [2] > > Now, if I want to try to account for the dangling nodes I need to sum all > PageRank values for the dangling nodes and divide it evenly among all other > pages (which in practce means to send a message to all vertexes in Giraph... > which is not possible, as it would be crazy with large graphs). > > I imagine one can use a SumAggregator and alternate computing contribution from > dangling nodes and PageRank values (i.e. PageRank values are computed only every > 2 supersteps). > > What do you think? > > Paolo > > [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The > Science of Search Engine Rankings" (2006) > [2] http://en.wikipedia.org/wiki/PageRank

Hi, I noticed that the SimplePageRankVertex implementation does not deal with dangling nodes (dangling nodes are those nodes with no outgoing links). And, probably that is one of the good reasons why there is a "Simple" in front. ;-) "Dangling nodes exist in many forms. For example, a page of data, a page with a postscript graph, [...], a page that has been fetched by a crawler but not yet explored - these are all examples of possible dangling nodes. The more ambitious the crawl, the bigger the proportion of dangling nodes because the set of fetched but uncrawled pages grows quickly." (pag. 80) [1] Wikipedia's PageRank page says: "If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again. When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability usually set to d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature." [2] Now, if I want to try to account for the dangling nodes I need to sum all PageRank values for the dangling nodes and divide it evenly among all other pages (which in practice means to send a message to all vertexes in Giraph... which is not possible, as it would be crazy with large graphs). I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are computed only every 2 supersteps). What do you think? Paolo [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The Science of Search Engine Rankings" (2006) [2] http://en.wikipedia.org/wiki/PageRank

Benjamin Heitmann wrote: > Hello, > > since giraph requires the maven munge plugin in order to have correct source, > and since google did not find any existing solution, I though it might be okay to ask this here: > > How can I tell eclipse or the m2e maven eclipse plugin to execute the munge maven plugin so that I have > giraph sources which actually compile in eclipse without errors ? > > Right now, I need to edit the java source files which are broken due to the munge syntax manually. > However, I need to do this every time those files get changed. > > Is there an easier way to do this ? Hi Benjamin, I do not have an answer/solution to your question. Just wanted to say I experience same pain. The goal of supporting all possible variants/version of Hadoop is great... but it comes with a cost and pain. Paolo > > > cheers, Benjamin.

Hi, > It would be good to present users a couple of non trivial examples and one > or > two 'real' use cases where Apache Giraph is used for processing large > graphs. > Apache Giraph comes with two examples: all shortest paths from a single > source > and PageRank. Google's Pregel paper describes 'bipartite matching' and > 'semi-clustering'. Is anyone working on implementing these in Giraph? > Or, what if in the shortest paths example you actually want to know the > path? > > I have some toy code (not really well tested) that implements b-matching (that is matching with integer capacities on the nodes). It's a simple greedy method, along the lines of the one described here www.vldb.org/pvldb/vol4/p460-morales.pdf I can share it if you are interested. Cheers, -- Gianmarco It would be great to have examples on more advanced features: custom > partitioning functions, aggregators, ... > > Personally, I'd like to see a side-by-side comparison of Google's Pregel as > described in their paper and Giraph implementation (I am particularly > interested > on where they diverge and why). > > Another question (or thing I am not so sure about) is about 'capacity > planning' > (sort of...). Given a dataset and an algorithm implemented in Giraph, how > you > determine how many workers would be needed (in order to fit all your graph > and > messages for each superstep in RAM)? > > Last but not least, it seems to me that PageRank is what you use to > 'benchmark' > Giraph, is that the case? If that is the case, sharing a common dataset for > others to use would be a first initial step to allow people to compare > performances of different software running the very same algorithm, over > the > same data and the same hardware infrastructure. > > Paolo > > Sebastian Schelter wrote: > > Hi, > > > > I will give a talk titled "Large Scale Graph Processing with Apache > > Giraph" in Berlin on May 29th. Details are available at: > > > > > https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 > > > > Best, > > Sebastian > >

user@mahout.apache.org Hi, by the way, about talks/presentations, here are the Apache Giraph talks/presentations I found: “Giraph: Large-scale graph processing on Hadoop”, Avery Ching Hadoop Summit 2011 - Santa Clara, California - June 2011 http://www.slideshare.net/averyching/20110628giraph-hadoop-summit http://www.youtube.com/watch?v=l4nQjAG6fac “Apache Giraph: Distributed Graph Processing in the Cloud”, Claudio Martella FOSDEM 2012 - Brussels, Belgium - February 2012 http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-processing-in-the-cloud/ http://blog.acaro.org/entry/giraph-talk-for-graphdevroom-fosdem-2012 http://www.youtube.com/watch?v=3ZrqPEIPRe4 http://www.youtube.com/watch?v=BmRaejKGeDM “Introducing Apache Giraph for Large Scale Graph Processing”, Sebastian Schelter Apache Hadoop Get Together - Berlin, Germany - April 2012 http://ssc.io/introducing-apache-giraph-for-large-scale-graph-processing/ http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing http://vimeo.com/40737998 You could put the links on the Apache Giraph wiki. First of all, thank you for sharing them and may I add a few comments or suggestions for future presentations? (don't take this as a critic, please)... It would be good to present users a couple of non trivial examples and one or two 'real' use cases where Apache Giraph is used for processing large graphs. Apache Giraph comes with two examples: all shortest paths from a single source and PageRank. Google's Pregel paper describes 'bipartite matching' and 'semi-clustering'. Is anyone working on implementing these in Giraph? Or, what if in the shortest paths example you actually want to know the path? It would be great to have examples on more advanced features: custom partitioning functions, aggregators, ... Personally, I'd like to see a side-by-side comparison of Google's Pregel as described in their paper and Giraph implementation (I am particularly interested on where they diverge and why). Another question (or thing I am not so sure about) is about 'capacity planning' (sort of...). Given a dataset and an algorithm implemented in Giraph, how you determine how many workers would be needed (in order to fit all your graph and messages for each superstep in RAM)? Last but not least, it seems to me that PageRank is what you use to 'benchmark' Giraph, is that the case? If that is the case, sharing a common dataset for others to use would be a first initial step to allow people to compare performances of different software running the very same algorithm, over the same data and the same hardware infrastructure. Paolo Sebastian Schelter wrote: > Hi, > > I will give a talk titled "Large Scale Graph Processing with Apache > Giraph" in Berlin on May 29th. Details are available at: > > https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 > > Best, > Sebastian

Warming up your audience :) On 12.05.2012 22:01, Jakob Homan wrote: > Stealing my thunder? :) > > On Sat, May 12, 2012 at 7:36 AM, Avery Ching <aching@apache.org> wrote: >> Nice! >> >> Avery >> >> >> On 5/12/12 2:58 AM, Sebastian Schelter wrote: >>> >>> Hi, >>> >>> I will give a talk titled "Large Scale Graph Processing with Apache >>> Giraph" in Berlin on May 29th. Details are available at: >>> >>> >>> https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 >>> >>> Best, >>> Sebastian >> >>

Stealing my thunder? :) On Sat, May 12, 2012 at 7:36 AM, Avery Ching <aching@apache.org> wrote: > Nice! > > Avery > > > On 5/12/12 2:58 AM, Sebastian Schelter wrote: >> >> Hi, >> >> I will give a talk titled "Large Scale Graph Processing with Apache >> Giraph" in Berlin on May 29th. Details are available at: >> >> >> https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 >> >> Best, >> Sebastian > >

Nice! Avery On 5/12/12 2:58 AM, Sebastian Schelter wrote: > Hi, > > I will give a talk titled "Large Scale Graph Processing with Apache > Giraph" in Berlin on May 29th. Details are available at: > > https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 > > Best, > Sebastian

Hi, I will give a talk titled "Large Scale Graph Processing with Apache Giraph" in Berlin on May 29th. Details are available at: https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 Best, Sebastian

Hello, since giraph requires the maven munge plugin in order to have correct source, and since google did not find any existing solution, I though it might be okay to ask this here: How can I tell eclipse or the m2e maven eclipse plugin to execute the munge maven plugin so that I have giraph sources which actually compile in eclipse without errors ? Right now, I need to edit the java source files which are broken due to the munge syntax manually. However, I need to do this every time those files get changed. Is there an easier way to do this ? cheers, Benjamin.

Hi Raimon, you are actually having two graphs in your use case description :) The first one is a bipartite graph consisting of a set of search queries on the one hand and a set of web pages on the other hand. >From this graph you want to create another graph where all pages that share a common query are connected. This is an algorithmic problem, not just a way of formatting the input. There are several ways to create this second graph. One way would be to use Mahout's ItemSimilarityJob with (query,page) tuples as input. ItemSimilarityJob will give you all or the top-k similar pages per page then (which will be (page,page) tuples). From this output you could create the second graph very easily. Alternatively you could think about implementing a pairwise similarity algorithm yourself in Giraph. You basically would need to find all pairs of vertices that share a common neighbor. --sebastian On 08.05.2012 19:17, Raimon Bosch wrote: > I'm designing a model to graph my web visits using the data in Access Log. > My idea is to create edges between my pages throught queries comming from > Google i.e. if a user searches for "used cars in NY" and hits one of my > pages (say A), and one month later another user searches for "used cars in > NY" and hits another of my pages (say B) I can create an edge between A and > B where the value of the edge will be the number of pages viewed for those > 2 users. > > So my question is more directly related with the format used in Apache > Giraph: > > - Can I give values to the edges? i.e. A to B (cost is 6), and B to C (cost > is 4). > > - In the shortestPath example we have an input like this: > > [10,4500,[[11,1000]]] > [11,5500,[[12,1100]]] > [12,6600,[[13,1200]]] > [13,7800,[[14,1300]]] > [14,9100,[[0,1400]]] > > Can you give us an overview how does this graph would look like? That would > be a nice document for the wiki page. > > - How it will be the input for my use case? (A -> B (cost 6), B -> C (cost > 4)) > > > Thanks in advance, > Raimon Bosch. > > pd: Some feedback about my model would be appreciated too. I haven't found > any papers about this topic yet. >

Hi all, I'm designing a model to graph my web visits using the data in Access Log. My idea is to create edges between my pages throught queries comming from Google i.e. if a user searches for "used cars in NY" and hits one of my pages (say A), and one month later another user searches for "used cars in NY" and hits another of my pages (say B) I can create an edge between A and B where the value of the edge will be the number of pages viewed for those 2 users. So my question is more directly related with the format used in Apache Giraph: - Can I give values to the edges? i.e. A to B (cost is 6), and B to C (cost is 4). - In the shortestPath example we have an input like this: [10,4500,[[11,1000]]] [11,5500,[[12,1100]]] [12,6600,[[13,1200]]] [13,7800,[[14,1300]]] [14,9100,[[0,1400]]] Can you give us an overview how does this graph would look like? That would be a nice document for the wiki page. - How it will be the input for my use case? (A -> B (cost 6), B -> C (cost 4)) Thanks in advance, Raimon Bosch. pd: Some feedback about my model would be appreciated too. I haven't found any papers about this topic yet.

Yes I was, that's exactly the paper by Abadi and his student. On Sun, May 6, 2012 at 11:34 PM, Semih Salihoglu <semih@stanford.edu> wrote: > I think he's referring to this paper from > Yale: http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf > > > On Sun, May 6, 2012 at 2:09 PM, Zuhair Khayyat <zuhair.khayyat@kaust.edu.sa> > wrote: >> >> Hi, >> >> The presenter in the following video referenced some paper starting minute >> 16 talking about something on top of hadoop and included partitioning and >> data querying. Can any one help me and pinpoint the paper he was talking >> about? >> >> http://www.youtube.com/watch?v=BmRaejKGeDM >> >> Regards, >> Zuhair Khayyat > > -- Claudio Martella claudio.martella@gmail.com

I think he's referring to this paper from Yale: http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf On Sun, May 6, 2012 at 2:09 PM, Zuhair Khayyat <zuhair.khayyat@kaust.edu.sa>wrote: > Hi, > > The presenter in the following video referenced some paper starting minute > 16 talking about something on top of hadoop and included partitioning and > data querying. Can any one help me and pinpoint the paper he was talking > about? > > http://www.youtube.com/watch?v=BmRaejKGeDM > > Regards, > Zuhair Khayyat >

Hi, The presenter in the following video referenced some paper starting minute 16 talking about something on top of hadoop and included partitioning and data querying. Can any one help me and pinpoint the paper he was talking about? http://www.youtube.com/watch?v=BmRaejKGeDM Regards, Zuhair Khayyat

I think you're right that the javadoc isn't specific enough. * Use a registered aggregator in current superstep. * Even when the same aggregator should be used in the next * superstep, useAggregator needs to be called at the beginning * of that superstep in preSuperstep(). * * @param name Name of aggregator * @return boolean (false when not registered) */ boolean useAggregator(String name); This should be augmented to say that none of the Aggregator methods should be called until this method is invoke. Feel free to file a JIRA and fix. Thanks! If you would like to, please feel free to add Aggregator documentation to https://cwiki.apache.org/confluence/display/GIRAPH/Index Avery On 5/2/12 12:15 PM, Benjamin Heitmann wrote: > Hello, > > I had to use aggregators for various statistic reporting tasks, > and I noticed that the aggregator operations need to be used in a very specific squence, > especially when the aggregator is getting a reset between supersteps. > > I found that the sequence described in RandomMessageBenchmark (in the org.apache.giraph.benchmark package) > results in consistent counts for one aggregator across all workers. > The most important thing, seems to be to call the reset method setAggregatedValue() in preSuperstep() of the WorkerContext class, > before calling this.useAggregator(). > > If I called the reset method in postSuperstep(), then every worker reported a different value for the aggregator. > > However, the aggregator which gets the reset between supersteps, still is wrong. > > I know this, because a second aggregator counts the same thing, and reports it after each superstep, > without getting a reset. > > Is this a known issue ? Should I file a bug report on it ? > > > In addition, it would be great to document correct usage of the aggregators somewhere. > Even just in the javadoc of the aggregator interface might be enough. > > Should I try to add some documentation to the aggregator interface? > (org.apache.giraph.graph.Aggregator.java) > Then the committers can correct me if that documentation is wrong, I guess.

Hello, I had to use aggregators for various statistic reporting tasks, and I noticed that the aggregator operations need to be used in a very specific squence, especially when the aggregator is getting a reset between supersteps. I found that the sequence described in RandomMessageBenchmark (in the org.apache.giraph.benchmark package) results in consistent counts for one aggregator across all workers. The most important thing, seems to be to call the reset method setAggregatedValue() in preSuperstep() of the WorkerContext class, before calling this.useAggregator(). If I called the reset method in postSuperstep(), then every worker reported a different value for the aggregator. However, the aggregator which gets the reset between supersteps, still is wrong. I know this, because a second aggregator counts the same thing, and reports it after each superstep, without getting a reset. Is this a known issue ? Should I file a bug report on it ? In addition, it would be great to document correct usage of the aggregators somewhere. Even just in the javadoc of the aggregator interface might be enough. Should I try to add some documentation to the aggregator interface? (org.apache.giraph.graph.Aggregator.java) Then the committers can correct me if that documentation is wrong, I guess.

Awesome! Congrats Eugene, we're excited to have you taking on a big role. Avery On 5/1/12 5:18 PM, Hyunsik Choi wrote: > Congrats and welcome Eugene! > I'm looking forward to your contribution. > > -- > Hyunsik Choi > > On Wed, May 2, 2012 at 5:39 AM, Jakob Homan <jghoman@gmail.com > <mailto:jghoman@gmail.com>> wrote: > > I'm happy to announce that the Giraph PMC has voted Eugene Koontz in > as a committer and PMC member. Eugene has been pitching in with great > patches that have been very useful, such as helping us sort out our > terrifying munging situation (GIRAPH-168). > > Welcome aboard, Eugene! > > -Jakob > >

Congrats and welcome Eugene! I'm looking forward to your contribution. -- Hyunsik Choi On Wed, May 2, 2012 at 5:39 AM, Jakob Homan <jghoman@gmail.com> wrote: > I'm happy to announce that the Giraph PMC has voted Eugene Koontz in > as a committer and PMC member. Eugene has been pitching in with great > patches that have been very useful, such as helping us sort out our > terrifying munging situation (GIRAPH-168). > > Welcome aboard, Eugene! > > -Jakob >

I'm happy to announce that the Giraph PMC has voted Eugene Koontz in as a committer and PMC member. Eugene has been pitching in with great patches that have been very useful, such as helping us sort out our terrifying munging situation (GIRAPH-168). Welcome aboard, Eugene! -Jakob

Hello Giraphers, I'd like to call your attention to BIGTOP-570 if you haven't seen it: https://issues.apache.org/jira/browse/BIGTOP-570 Please let me know if this isn't something you want. I should have asked first, my apologies for not thinking of that until now. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

Hello Avery, On 1 May 2012, at 15:45, Avery Ching wrote: > I wonder if the issues you are seeing are related to https://issues.apache.org/jira/browse/GIRAPH-169. > > This shouldn't happen. Good to know that that should not happen. For my specific algorithm it happens all the time. For small amounts of processing the job finishes 2 minutes after the mappers report a 100%. For larger amounts it can take 20 minutes or so. So there is definitively a connection between the expected length of processing the job, and the amount of time which passes after the mappers report 100%. I even had a pretty extreme case where most of the workers where restarted after an hour, and I killed the job after 90 minutes. In addition, the "100% map" always comes about 14-15 minutes after starting the job, independent of the total processing time. That might be due to the time it takes to read in the data, which is always around 11 minutes for the "vertex input superstep". (The data (and its size) which my job reads in order to construct the graph is always the same. Only the "configuration" of the algorithm changes. In my case, the configuration consists of the set of start nodes, and the association between different start nodes and user ids). Should I attach a zip file of the log directory for the job which restarted most of its workers after an hour ? I can attach that to the JIRA issue.

Benjamin, I wonder if the issues you are seeing are related to https://issues.apache.org/jira/browse/GIRAPH-169. This shouldn't happen. Avery On 5/1/12 7:29 AM, Benjamin Heitmann wrote: > Hello, > > under which circumstances is it possible that a Giraph job, will report that he is 100% finished will his mappers, > but the job is still running ? > > I can see that it is still running from: > * debugging messages, > * list of running threads (in top) > * the hadoop jobtracker web site (it reports 100% completion of mappers, but also that all mappers are running, and none is complete) > > I am currently running some benchmarks to get a handle on the scalability of giraph and of my algorithm implementation. > And the results up to now are very confusing... > > > Looking forward to any answers, cheers, Benjamin.

Hello, under which circumstances is it possible that a Giraph job, will report that he is 100% finished will his mappers, but the job is still running ? I can see that it is still running from: * debugging messages, * list of running threads (in top) * the hadoop jobtracker web site (it reports 100% completion of mappers, but also that all mappers are running, and none is complete) I am currently running some benchmarks to get a handle on the scalability of giraph and of my algorithm implementation. And the results up to now are very confusing... Looking forward to any answers, cheers, Benjamin.

I'm excited too! Nice work Jakob. On 4/30/12 1:37 PM, Jakob Homan wrote: > I'm quite excited to be presenting a Giraph talk at Berlin Buzzwords > this year: http://berlinbuzzwords.de/sessions/large-scale-graph-computation-apache-giraph > > Come support your local BSP large-scale graph processing framework! > > Everyone that can make it to Berlin, please do. BB is without a doubt > the best conference I've been to and it's well worth the trip. Also, > don't forget Sebastian's workshop which will following BB. > > -Jakob