giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low
Date Fri, 17 Aug 2012 18:20:31 GMT
Agreed, we should (and will be) moving more along these directions (byte 
arrays / ByteBuffer) in the future for Giraph.

On 8/17/12 8:36 AM, Sebastian Schelter wrote:
> Stratosphere even employs its own memory management by serializing data
> to preallocated byte arrays. This does not only allow for a very compact
> representation of the data, but also avoids major GC pauses and allows
> different buffer implementations to gracefully spill to disk.
>
> On 17.08.2012 17:17, Claudio Martella wrote:
>> Yes, that is definitely the direction you may want to take at a certain
>> moment. That is basically what Stanford gps does as well, and stratosphere
>> too.
>>
>> On Friday, August 17, 2012, Alessandro Presta wrote:
>>
>>> I think at that point it would be worth having a new logical place for
>>> vertex/edge representation at worker- or partition-level.
>>> Avery had some ideas about this.
>>>
>>> Basically right now we're giving the user the freedom (and responsibility)
>>> to choose a representation (both in-memory and for serialization), but
>>> another way to go would be to take care of all that at infrastructure
>>> level and expose only one Vertex class (where the user only defines the
>>> computation details and everything else is abstracted away). Then we could
>>> play around with compact representations and even more disruptive
>>> strategies (like streaming the graph/messages and re-using objects).
>>>
>>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org<javascript:;>
>>> wrote:
>>>
>>>> I was under the impression that 100k was the upper limit to make things
>>>> work without crashing.
>>>>
>>>> In any case, if one wanted to use a compressed memory representation by
>>>> aggregating different edge lists together, could one use the worker
>>>> context
>>>> as a central point of access to the compressed graphs?
>>>> I can imagine a vertex class that has only the ID and uses the worker
>>>> context to access its edge list (i.e. it is only a client to a central
>>>> per-machine repository).
>>>> Vertexes in the same partition would share this data structure.
>>>>
>>>> Is there any obvious technical fallacy in this scheme?
>>>>
>>>> Cheers,
>>>> --
>>>> Gianmarco
>>>>
>>>>
>>>>
>>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
>>>> <alessandro@fb.com>wrote:
>>>>
>>>>> The example where we actually go out of memory was with 500K vertices
>>>>> and
>>>>> 500M edges, but yes, as a general rule we should strive to reduce our
>>>>> memory footprint in order to push the point where we need to go out of
>>>>> core as far away as possible.
>>>>>
>>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Very interesting.
>>>>>>
>>>>>> On a side note, a graph with 100k vertices and 100M edges is largish
>>>>> but
>>>>>> not that big after all.
>>>>>> If it does not fit on 10+ GB of memory, it means that each edge
>>>>> occupies
>>>>>> around 100B (amortizing the cost of the vertex over the edges).
>>>>>> In my opinion this deserves some thought.
>>>>>> If memory is an issue, why not think about compressed memory
>>>>> structures,
>>>>>> at
>>>>>> least for common graph formats?
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Gianmarco
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
>>>>>> <initialcontext@gmail.com>wrote:
>>>>>>
>>>>>>> Great metrics, this made a very interesting read, and great code
too
>>>>> as
>>>>>>> always. This must have been a lot of work. I like the idea of
>>>>>>> eliminating
>>>>>>> the extra temporary storage data structures where possible, even
when
>>>>>>> not
>>>>>>> going out-of-core. I think that + avoiding extra object creation
>>>>> during
>>>>>>> the
>>>>>>> workflow can still do a lot for in-core job's memory profile,
but
>>>>> this
>>>>>>> is
>>>>>>> looking really good and sounds like with the config options its
also
>>>>>>> pluggable depending on your hardware situation, so it sounds
great to
>>>>>>> me.
>>>>>>> Great work!
>>>>>>>
>>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
>>>>>>> <jira@apache.org>wrote:
>>>>>>>
>>>>>>>>      [
>>>>>>>>
>>> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
>>>>>>> a
>>>>> .
>>>>>
>>>>>>> plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
>>>>>>> #c
>>>>>>> omment-13435437
>>>>>>> ]
>>>>>>>> Alessandro Presta commented on GIRAPH-249:
>>>>>>>> ------------------------------------------
>>>>>>>>
>>>>>>>> Thanks Claudio, good observation.
>>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
>>>>>>>>
>>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
>>>>>>>>
>>>>>>>> This is with trunk:
>>>>>>>>
>>>>>>>> {code}
>>>>>>>> hadoop jar giraph-trunk.jar
>>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
>>>>>>> -Dgiraph.useN
>>
>>


Mime
View raw message