giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Reisman <initialcont...@gmail.com>
Subject Re: [jira] [Commented] (GIRAPH-249) Move part of the graph out-of-core when memory is low
Date Mon, 20 Aug 2012 17:50:22 GMT
Great discussion here, very interesting read. In some cases it seems like a
simple command line options (spill to disk or not boolean etc.) are great
when the vertex requires no other changes or recompilation to run under
this strategy. For more dramatic changes like using more compact, reusable
data structures within Giraph instead of creating/destroying all these
writables all the time, the idea of several differing Vertex base classes
with documentation on best practices for each would make life easy for
users with a particular optimized approach to a problem who just want to
get right to implementing it without mastering all Giraph's details.

I have felt for a long time that Writables belong on the input and output
fringes of Giraph and internally we are in great need of more compact and
reusable ways to work with the graph data. We do not run on outlandish
amounts of memory here (4GB per node is our ballpark) and try to scale to
lots of machines to spread the load around. We are running many orders of
magnitude more V's and E's than the data loads being discussed here this
way with success. Obviously there are use cases you all have where the
spill to disk is an absolute necessity. But I do think you will be
surprised should we implement a more compact was to deal with data between
nodes how far that will go if you do want to process in-memory and its a
great idea to pursue. The creator of Apache Crunch claims by choosing a
standard data format for internal processing steps while composing and
running MR jobs compiled from Crunch, regardless of the input and output
formats the final data will take, he has achieved great speedups over
similar tools like Pig. Of course he's using Avro, and we would need
something closer to Sebastian's ByteBuffer/byte array idea, but I think
there's a lot room to explore this.

Does anyone see a compelling reason why input and output couldn't use
Writables and tangle with generics on the fringes of Giraph job operations,
while allowing us to remove them from the internals, use a standard data
format such as byte[] (etc) in the Giraph plumbing, and especially in data
transfers over the wire? It would be a big overhaul, but more and more it
seems like a logical step. getting rid of the I,V,E,M in their current
Writables format threaded all through the code would simplify a lot of the
internals and provide users more freedom (again, see Crunch, it lets folks
do MR jobs using POJO's and a very simple coding paradigm)

Someone told me that the GreenMarl team at stanford can run 10 billion
edges in 50 seconds on a single machine, and on no more memory than  some
recent Giraph runs have been using per-worker at some well know businesses
(although more than I'm using.)

It seems like we have to have some kind of reasonable in-memory story to go
with our other strategies to fully leverage Giraph's value proposition
(running on existing Giraph cluster, part of Hadoop ecosystem, and so on.)

Perhaps moving to a more practical internal data structure for graph data
is a reasonable approach that will help to serve both the spill-to-disk
advancements and the in-memory strategy for Giraph.

On Mon, Aug 20, 2012 at 9:09 AM, Gianmarco De Francisci Morales <
gdfm@gdfm.me> wrote:

> Hi Benjamin,
>
> Thanks for sharing your ideas.
> First, let me clarify that my proposal does not aim at excluding any of the
> current use cases from Giraph. I just would like to improve some common use
> cases that come up very often.
>
> Now, on the technical side of processing RDF.
> The type of a vertex/edge could be represented as an enum which is an
> integer number in the end.
> I guess that having a central String<->ID map per worker is more memory
> efficient than representing explicitly the vertex/edge attributes as
> Strings, up to any reasonable distribution of the vertexes I can think of
> (i.e. if you have 1 vertex per machine then of course it is worse, but
> that's pathological anyway).
> If creating this kind of dictionary is easy (or even already done) for the
> user, and there is a well defined and documented idiom to do this kind of
> computation, I don't think users will have a problem with filling in some
> structure in the worker rather than implementing a vertex input format.
>
> Custom state in vertexes is for sure something I would like to keep in
> Giraph (at the expense of higher memory footprint).
> On the other hand, Giraph is about big data, so it is suboptimal that to
> equal the performance that I get on a single machine I need to use 10
> machines in Giraph, just because I need to load the graph uncompressed.
> I would also like to see improvements for common simple cases that would
> make Giraph more useful in practice.
>
> The right API to allow all these cases should come out of this discussion.
>
> Cheers,
> --
> Gianmarco
>
>
>
> On Mon, Aug 20, 2012 at 5:44 PM, Benjamin Heitmann <
> benjamin.heitmann@deri.org> wrote:
>
> >
> > Hello, just a few in-line comments regarding the simplification of vertex
> > classes.
> >
> > In my opinion the proposed change might exclude all typed graphs, and all
> > Sematic Web style processing from Giraph.
> >
> > On 17 Aug 2012, at 14:30, Gianmarco De Francisci Morales wrote:
> >
> > > In any case, if one wanted to use a compressed memory representation by
> > > aggregating different edge lists together, could one use the worker
> > context
> > > as a central point of access to the compressed graphs?
> > > I can imagine a vertex class that has only the ID and uses the worker
> > > context to access its edge list (i.e. it is only a client to a central
> > > per-machine repository).
> > > Vertexes in the same partition would share this data structure.
> >
> > In the current vertex class signature, every user vertex can choose to
> > have a complex class to hold the state of the vertex.
> >
> > Will that capability be gone with this proposed simplification of a
> vertex
> > to only hold an id and a list of neighbour vertices?
> >
> > While most of the popular graph algorithms only take the graph itself
> into
> > account, there are types of algorithm which also can take the semantics
> of
> > the graph, of a node and of an edge into account. Basically everything
> from
> > the area of Semantic Web graph analysis falls in this area, and one
> > specific type of algorithm is spreading activation.
> >
> > In a nut-shell, spreading activation is a breadth first search which is
> > guided by the semantics of the vertices and edges.
> >
> > An example: return all persons and posts which are somehow related to
> this
> > one person. In addition, all vertices which are not persons or posts, and
> > give twice as much weight in the ranking to properties from the music
> > domain (all other properties have normal weight).
> >
> > If semantics can not be stored as part of a vertex or an edge, then this
> > would require an external database lookup for each compute() call to a
> > vertex. That would basically eliminate all reasons to use giraph for this
> > kind of algorithm.
> >
> > > Is there any obvious technical fallacy in this scheme?
> >
> > Not a technical fallacy, but I would argue that a lot will be lost by not
> > giving developers a mechanism for including custom state in their vertex.
> > Of course, developers need to be aware that this will increase the memory
> > footprint of their objects, and I guess serialising/deserialising of
> > strings will be a huge issue.
> >
> > But that should not be a reason to completely exclude such algorithms
> from
> > using giraph.
> > Or to exclude any kind of typed graph, semantic network from using
> giraph.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message