giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@gmail.com>
Subject Re: Can Giraph handle graphs with very large number of edges per vertex?
Date Sun, 16 Sep 2012 09:24:28 GMT
Hi Eli,
mine was just at an attempt pre-GIRAPH-249 (and my attempt was not
successful because I had problems sub-classing MutableVertex at the
time, if I remember correctly. I think I shared some of my issues at
the time...). Anyway, now that GIRAPH-249 is closed, the need for that
is less or gone. I just need some time to look at Giraph source code
now and test/use what you did for GIRAPH-249. Giraph is making such
good progress and I still need to catch up. :-)

If/when I do that and I see any valuable (i.e. faster) alternative,
I'll share it here or implement an example and share the code on
GitHub.

Cheers,
Paolo

On 14 September 2012 11:03, Eli Reisman <apache.mailbox@gmail.com> wrote:
> Great discussion. I am very curious about the Apache Jena spill solution you
> were speaking of, will check it out. What was your impression, was it
> successful for your uses? Sounds like it was not so hard to adapt for this
> use?
>
> The good news is, Giraph recently acquired the ability to optionally spill
> both messages and vertex data to disk to avoid overloads, and when
> configured right should provide the functionality you're looking for. Even
> though Giraph rides atop the Hadoop framework, it performs its calculations
> in a fundamentally different paradigm than MapReduce, so I doubt we will
> ever fully replicate the ability of Hadoop to trade cluster size for
> calculation time so transparently. Regarding the use of these new features,
> there are threads on the Giraph JIRA list by Maja, Claudio, and Alessandro
> regarding these issues I'd recommend reading. Try it out, and please let us
> know how it goes for you. Its exciting for us to have these features
> available now.
>>
>> Many thanks for the point-by-point replies. It clarifies a lot of
>> questions I had.
>>
>> The Pregel papers did throw more light on the approach and architecture.
>>
>>
>>
>> Hi Eli,
>>
>> Your feedback about very large scale applications on Giraph sounds very
>> encouraging. Thanks very much.
>>
>>
>>
>> After reading both of your replies, I have some (final!) questions
>> regarding memory usage:
>>
>> ·         For applications with a large number of edges per vextex: Are
>> there any built-in vertex or helper classes or at least sample code which
>> feature spilling of edges to disk, or some kind of disk-backed map of edges,
>> to support such vertices? Or do we have to sort of roll our own?
>>
>> ·         For graphs with a large number of vertices relative to available
>> workers, at least in development phase,  one may not always have access to a
>> large number of workers, yet one might wish to process a very large graph.
>> In these cases, it may happen that the workers may not be able to hold all
>> their assigned vertices in memory. So again in this case, are there any
>> built-in classes to allow spilling of vertices to disk, or a similar kind of
>> disk-backed map?
>>
>> ·         Assuming some kind of disk backing is implemented to handle
>> large number of vertices/edges (under a situation of insufficient # of
>> workers or memory per worker), is it likely that just the volume of IO
>> (message/IPC) could cause OOMEs? Or merely slowdowns?
>>
>>
>>
>> In general, I feel that one of the reasons for wide and rapid adoption of
>> Hadoop is the “download, install and run” feature, where even for large data
>> sets, the stock code will still run to completion on a single laptop (or a
>> single Linux server, etc), except that it will take more time. But this may
>> be perfectly acceptable for people who are evaluating and experimenting,
>> since there is no incurred cost for hardware. A lot of developers might be
>> OK with giving the thing a run overnight on their laptops or fire up just
>> one spot instance on EC2 etc and let it chug along for a couple of days.
>>
>> I know this was the case for me when I was starting out with Hadoop. So
>> more nodes are needed only to speed things up, but not for functionality.
>>
>> It might be great to include such features into Giraph also…. which will
>> require that disk backed workers be supported in the code as standard
>> feature…
>>
>>
>>
>> Would love to hear your thoughts on these…
>>
>>
>>
>> Thanks,
>>
>> Jeyendran
>>
>>
>>
>>
>>
>> From: Eli Reisman [mailto:apache.mailbox@gmail.com]
>> Sent: Tuesday, September 11, 2012 12:11 PM
>> To: user@giraph.apache.org
>> Subject: Re: Can Giraph handle graphs with very large number of edges per
>> vertex?
>>
>>
>>
>> Hi Jeyendran, I was just sayiing the same thing about the documentation on
>> another thread, couldn't agree more. There will be progress on this soon, I
>> promise. I'd like us to reach a model of "if you add a new feature or change
>> a core feature, the patch gets committed contingent on a new wiki page of
>> docs going up on the website." There's still nothing about our new Vertex
>> API, master compute, etc. on the wiki.
>>
>> I would say 8 gigs to play with is a great amount where you will most
>> definitely be able to get very large interesting graphs to run in-memory,
>> depending on how many workers (with 8G each) you have to work with. having
>> 3-4 workers per machine is not a bad thing if you are provisioned to do
>> this. And lots of machines. This is a distributed batch processing
>> framework, so more is better ;)
>>
>> as far as vertices with a million edges, sure but it depends on how many
>> of them and your compute resources. Again, can't go into much detail but
>> Giraph has been extensively tested using real-world, large, interesting,
>> useful graph data. This includes large social graphs that have supernodes.
>> So if you're supplying that, and you have the gear to run your data, you've
>> picked the right tool. You can spill to disk, run in memory, or spread the
>> load and scale to many, many workers (Mapper tasks) hosted on many nodes and
>> Giraph will behave well if you have the compute resource to scale to fit
>> your volume of data.
>>
>> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <aching@apache.org> wrote:
>>
>> Hi Jeyendran, nice to meet you.
>>
>> Answers inline.
>>
>>
>>
>> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote:
>>
>> I am trying to understand what kind of data Giraph holds in memory per
>> worker.
>> My questions in descending order of importance:
>> 1. Does Giraph hold in memory exactly one vertex of data at a time, or
>> does
>> it need to hold all the vertexes assigned to that worker?
>>
>> All vertices assigned to that worker.
>>
>>
>>
>> 2. Can Giraph handle vertexes with, a million edges per vertex?
>>
>> Depends on how much memory you have.  Would recommend making a custom
>> vertex implementation that has a very efficient store for better scalability
>> (i.e. see IntIntNullIntVertex).
>>
>>
>>
>>     If not, at what order of magnitude does it break down? - 1000 edges,
>> 10K
>> edges, 100K edges?...
>>    (Of course, I understand that this depends upon the -Xmx value, so
>> let's
>> say we fix a value of -Xmx8g).
>> 3. Are there any limitations on the kind of objects that can be used as
>> vertices?
>>     Specifically, does Giraph assume that vertices are lightweight (eg,
>> integer vertex ID + simple Java primitive vertex values + collection of
>> out-edges),
>>     or can Giraph support heavyweight vertices (hold complex nested Java
>> objects in a vertex)?
>>
>> Limitations are that the vertex implementation must be Writable, the
>> vertex index must be WritableComparable, edge type Writable, message type
>> Writable.
>>
>>
>>
>> 4. More generally, what data is stored in memory, and what, if any, is
>> offloaded/spilled to disk?
>>
>> Messages and vertices can be spilled to disk, but you must enable this.
>>
>>
>>
>> Would appreciate any light the experts can throw on this.
>>
>> On this note, I would like to mention that the presentations posted on the
>> Wiki explain what Giraph can do, and how to use it from  a coding
>> perspective, but there are no explanations of the design approach used,
>> the
>> rationale behind the choices, and the software architecture. I feel that
>> new
>> users can really benefit from a design  and architecture document, along
>> the
>> lines of Hadoop and  Lucene. For folks who are considering whether or not
>> to
>> use Giraph, this can be a big help. The only alternative today is to read
>> the source code, the burden of which might in itself be reason for folks
>> not
>> to consider using Giraph.
>> My 2c  :-)
>>
>>
>>
>> Agreed that documentation is lacking =).  That being said, the
>> presentations explain most of the design approach and reasons.  I would
>> refer to the Pregel paper for a more detailed look or ask if you have any
>> specific questions.
>>
>>
>> Thanks a lot,
>>
>> No problem!
>>
>> Jeyendran
>>
>>
>>
>>
>
>

Mime
View raw message