giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low
Date Fri, 17 Aug 2012 15:36:30 GMT
Stratosphere even employs its own memory management by serializing data
to preallocated byte arrays. This does not only allow for a very compact
representation of the data, but also avoids major GC pauses and allows
different buffer implementations to gracefully spill to disk.

On 17.08.2012 17:17, Claudio Martella wrote:
> Yes, that is definitely the direction you may want to take at a certain
> moment. That is basically what Stanford gps does as well, and stratosphere
> too.
> 
> On Friday, August 17, 2012, Alessandro Presta wrote:
> 
>> I think at that point it would be worth having a new logical place for
>> vertex/edge representation at worker- or partition-level.
>> Avery had some ideas about this.
>>
>> Basically right now we're giving the user the freedom (and responsibility)
>> to choose a representation (both in-memory and for serialization), but
>> another way to go would be to take care of all that at infrastructure
>> level and expose only one Vertex class (where the user only defines the
>> computation details and everything else is abstracted away). Then we could
>> play around with compact representations and even more disruptive
>> strategies (like streaming the graph/messages and re-using objects).
>>
>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org<javascript:;>
>>>
>> wrote:
>>
>>> I was under the impression that 100k was the upper limit to make things
>>> work without crashing.
>>>
>>> In any case, if one wanted to use a compressed memory representation by
>>> aggregating different edge lists together, could one use the worker
>>> context
>>> as a central point of access to the compressed graphs?
>>> I can imagine a vertex class that has only the ID and uses the worker
>>> context to access its edge list (i.e. it is only a client to a central
>>> per-machine repository).
>>> Vertexes in the same partition would share this data structure.
>>>
>>> Is there any obvious technical fallacy in this scheme?
>>>
>>> Cheers,
>>> --
>>> Gianmarco
>>>
>>>
>>>
>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
>>> <alessandro@fb.com>wrote:
>>>
>>>> The example where we actually go out of memory was with 500K vertices
>>>> and
>>>> 500M edges, but yes, as a general rule we should strive to reduce our
>>>> memory footprint in order to push the point where we need to go out of
>>>> core as far away as possible.
>>>>
>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org>
>>>> wrote:
>>>>
>>>>> Very interesting.
>>>>>
>>>>> On a side note, a graph with 100k vertices and 100M edges is largish
>>>> but
>>>>> not that big after all.
>>>>> If it does not fit on 10+ GB of memory, it means that each edge
>>>> occupies
>>>>> around 100B (amortizing the cost of the vertex over the edges).
>>>>> In my opinion this deserves some thought.
>>>>> If memory is an issue, why not think about compressed memory
>>>> structures,
>>>>> at
>>>>> least for common graph formats?
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Gianmarco
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
>>>>> <initialcontext@gmail.com>wrote:
>>>>>
>>>>>> Great metrics, this made a very interesting read, and great code
too
>>>> as
>>>>>> always. This must have been a lot of work. I like the idea of
>>>>>> eliminating
>>>>>> the extra temporary storage data structures where possible, even
when
>>>>>> not
>>>>>> going out-of-core. I think that + avoiding extra object creation
>>>> during
>>>>>> the
>>>>>> workflow can still do a lot for in-core job's memory profile, but
>>>> this
>>>>>> is
>>>>>> looking really good and sounds like with the config options its also
>>>>>> pluggable depending on your hardware situation, so it sounds great
to
>>>>>> me.
>>>>>> Great work!
>>>>>>
>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
>>>>>> <jira@apache.org>wrote:
>>>>>>
>>>>>>>
>>>>>>>     [
>>>>>>>
>>>>>>
>>>>
>>>>>>
>> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
>>>>>> a
>>>> .
>>>>
>>>>>> plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
>>>>>> #c
>>>>>> omment-13435437
>>>>>> ]
>>>>>>>
>>>>>>> Alessandro Presta commented on GIRAPH-249:
>>>>>>> ------------------------------------------
>>>>>>>
>>>>>>> Thanks Claudio, good observation.
>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
>>>>>>>
>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
>>>>>>>
>>>>>>> This is with trunk:
>>>>>>>
>>>>>>> {code}
>>>>>>> hadoop jar giraph-trunk.jar
>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
>>>>>> -Dgiraph.useN
> 
> 
> 


Mime
View raw message