giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low
Date Sat, 18 Aug 2012 10:11:40 GMT
Hi,
interesting discussion.

On Fri, Aug 17, 2012 at 11:03 PM, Alessandro Presta <alessandro@fb.com>wrote:

> So, should we start brainstorming on possible approaches?
>
> I'll begin by listing some assumptions we may (or may not) consider
> reasonable (most of these have already come up at some point from various
> people):
>
> - Most algorithms need to iterate over all edges/messages, hence we could
> stop offering random access. This should allow for compact/serialized
> representations and even streaming data as we compute.
>

I guess we could have different versions of basic vertexes that we could
provide, each with different features.
For example RandomAccessVertex could provide fast random access to edges at
the expense of more memory footprint, while SequentialAccessVertex could
save space by encoding the edges in some compact form (e.g. gap coding).


> - We need to reduce object instantiations. By keeping data serialized, we
> could re-use objects: think of a single vertex object that reads its
> serialized data and computes on the spot; edge/message iterators can be
> backed by a byte array/stream.
>

That's definitely a must. Object instantiation takes a long time, bloats
the memory and forces the GC to do extra work.
By keeping everything in ByteBuffers and using primitives we could remove
object instantiation completely.
The only problem, alas, is that Java generics don't play well with
primitives.


> - It may be desirable to always store at least the vertex ids in memory. A
> partition can become a container of ids instead of full vertices. We may
> restrict the vertex id type to something like 64bit integers if that makes
> life easier.
>

I guess that would do in most cases.
Though sometimes it might be convenient to use strings as IDs, maybe even
just for testing.

On the other hand, if we have a good random access structure that can
compress also the IDs, why not use it to implement a partition and keep
everything there?

Cheers,
--
Gianmarco


> On 8/17/12 7:20 PM, "Avery Ching" <aching@apache.org> wrote:
>
> >Agreed, we should (and will be) moving more along these directions (byte
> >arrays / ByteBuffer) in the future for Giraph.
> >
> >On 8/17/12 8:36 AM, Sebastian Schelter wrote:
> >> Stratosphere even employs its own memory management by serializing data
> >> to preallocated byte arrays. This does not only allow for a very compact
> >> representation of the data, but also avoids major GC pauses and allows
> >> different buffer implementations to gracefully spill to disk.
> >>
> >> On 17.08.2012 17:17, Claudio Martella wrote:
> >>> Yes, that is definitely the direction you may want to take at a certain
> >>> moment. That is basically what Stanford gps does as well, and
> >>>stratosphere
> >>> too.
> >>>
> >>> On Friday, August 17, 2012, Alessandro Presta wrote:
> >>>
> >>>> I think at that point it would be worth having a new logical place for
> >>>> vertex/edge representation at worker- or partition-level.
> >>>> Avery had some ideas about this.
> >>>>
> >>>> Basically right now we're giving the user the freedom (and
> >>>>responsibility)
> >>>> to choose a representation (both in-memory and for serialization), but
> >>>> another way to go would be to take care of all that at infrastructure
> >>>> level and expose only one Vertex class (where the user only defines
> >>>>the
> >>>> computation details and everything else is abstracted away). Then we
> >>>>could
> >>>> play around with compact representations and even more disruptive
> >>>> strategies (like streaming the graph/messages and re-using objects).
> >>>>
> >>>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales"
> >>>><gdfm@apache.org<javascript:;>
> >>>> wrote:
> >>>>
> >>>>> I was under the impression that 100k was the upper limit to make
> >>>>>things
> >>>>> work without crashing.
> >>>>>
> >>>>> In any case, if one wanted to use a compressed memory representation
> >>>>>by
> >>>>> aggregating different edge lists together, could one use the worker
> >>>>> context
> >>>>> as a central point of access to the compressed graphs?
> >>>>> I can imagine a vertex class that has only the ID and uses the worker
> >>>>> context to access its edge list (i.e. it is only a client to a
> >>>>>central
> >>>>> per-machine repository).
> >>>>> Vertexes in the same partition would share this data structure.
> >>>>>
> >>>>> Is there any obvious technical fallacy in this scheme?
> >>>>>
> >>>>> Cheers,
> >>>>> --
> >>>>> Gianmarco
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
> >>>>> <alessandro@fb.com>wrote:
> >>>>>
> >>>>>> The example where we actually go out of memory was with 500K
> >>>>>>vertices
> >>>>>> and
> >>>>>> 500M edges, but yes, as a general rule we should strive to reduce
> >>>>>>our
> >>>>>> memory footprint in order to push the point where we need to
go out
> >>>>>>of
> >>>>>> core as far away as possible.
> >>>>>>
> >>>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales"
> >>>>>><gdfm@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Very interesting.
> >>>>>>>
> >>>>>>> On a side note, a graph with 100k vertices and 100M edges
is
> >>>>>>>largish
> >>>>>> but
> >>>>>>> not that big after all.
> >>>>>>> If it does not fit on 10+ GB of memory, it means that each
edge
> >>>>>> occupies
> >>>>>>> around 100B (amortizing the cost of the vertex over the
edges).
> >>>>>>> In my opinion this deserves some thought.
> >>>>>>> If memory is an issue, why not think about compressed memory
> >>>>>> structures,
> >>>>>>> at
> >>>>>>> least for common graph formats?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> --
> >>>>>>> Gianmarco
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
> >>>>>>> <initialcontext@gmail.com>wrote:
> >>>>>>>
> >>>>>>>> Great metrics, this made a very interesting read, and
great code
> >>>>>>>>too
> >>>>>> as
> >>>>>>>> always. This must have been a lot of work. I like the
idea of
> >>>>>>>> eliminating
> >>>>>>>> the extra temporary storage data structures where possible,
even
> >>>>>>>>when
> >>>>>>>> not
> >>>>>>>> going out-of-core. I think that + avoiding extra object
creation
> >>>>>> during
> >>>>>>>> the
> >>>>>>>> workflow can still do a lot for in-core job's memory
profile, but
> >>>>>> this
> >>>>>>>> is
> >>>>>>>> looking really good and sounds like with the config
options its
> >>>>>>>>also
> >>>>>>>> pluggable depending on your hardware situation, so it
sounds
> >>>>>>>>great to
> >>>>>>>> me.
> >>>>>>>> Great work!
> >>>>>>>>
> >>>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta
(JIRA)
> >>>>>>>> <jira@apache.org>wrote:
> >>>>>>>>
> >>>>>>>>>      [
> >>>>>>>>>
> >>>>
> >>>>
> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
> >>>>>>>> a
> >>>>>> .
> >>>>>>
> >>>>>>>>
> >>>>>>>>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1343
> >>>>>>>>5437
> >>>>>>>> #c
> >>>>>>>> omment-13435437
> >>>>>>>> ]
> >>>>>>>>> Alessandro Presta commented on GIRAPH-249:
> >>>>>>>>> ------------------------------------------
> >>>>>>>>>
> >>>>>>>>> Thanks Claudio, good observation.
> >>>>>>>>> You got me curious so I quickly ran a shortest paths
benchmark.
> >>>>>>>>>
> >>>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
> >>>>>>>>>
> >>>>>>>>> This is with trunk:
> >>>>>>>>>
> >>>>>>>>> {code}
> >>>>>>>>> hadoop jar giraph-trunk.jar
> >>>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
> >>>>>>>> -Dgiraph.useN
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message