incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-28) Introduce new primitive-specific MutableVertex subclasses
Date Mon, 12 Sep 2011 05:12:08 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102428#comment-13102428
] 

Jake Mannix commented on GIRAPH-28:
-----------------------------------

Yeah, I don't know what's eating up the space in my Primitive class, but I *bet* it has something
to do with my "wrapper" SortedMap I return when you call getOutEdgeMap().  This object *should*
be lazily instantiated, but as it has a "get***" name to it, I wonder if the ObjectSizeCalculator
is considering the returned value to be part of the object's memory?  If so, then it *should*
be what we see.  Dmitriy, can you try re-running your *original* test, on the real LongDoubleFloatDoubleVertex,
but with getOutEdgeMap() implemented as just "return null;", and similarly with getMsgList()?
 If you test doesn't use those methods.  My guess is that the measured size will drop down
to close to what your results for "Primitive Map" are.

Regarding long[] and float[], it's actually not a totally crazy implementation, for the "final
state" of the in-memory representation: most graph algorithms leave the graph structure alone
(ie don't add or delete edges during iteration), in which case compacting the Vertex down
to a pair of parallel sorted arrays after loading from the VertexReader is not such a unreasonable
situation.  Of course, it then requires that random-access reads to the outbound edges do
a log(numOutEdges) binary search, but not all graph algorithms require lots of random access
to specific edges (as opposed to bulk access to all edges).  

I could imagine two separate primitive implementations - one based on a map-like structure
(a la OpenXxxYyyHashMap), which gives random access to the edges (but then *not* sorted access!
 Note that the current code in my primitive vertex throws UnsupportedOperationException for
lots of SortedMap methods, which happen to be non exercised in the unit tests), and one based
on parallel arrays, which allows for very fast sequential read-only access, but slower random
read-only access, and very much slower mutation speed.  I wrote implementations like this
for the Mahout Vector interface, and named them almost exactly like that: RandomAccessSparseVector,
and SequentialAccessSparseVector (although I should have called the latter something like
ReadOnlySequentialAccessSparseVector, but it would have been a lie, as you *can* modify it,
you just shouldn't).

These numbers now make a lot of sense - Objects Are Big, that's the moral of the story.  Any
place you need to hang onto gazillions of things in memory on the JVM, you should do your
best to stuff them into primitive arrays.  

> Introduce new primitive-specific MutableVertex subclasses
> ---------------------------------------------------------
>
>                 Key: GIRAPH-28
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-28
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>    Affects Versions: 0.70.0
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>         Attachments: GIRAPH-28.diff
>
>
> As discussed on the list, MutableVertex<LongWritable,DoubleWritable,FloatWritable,DoubleWritable>
(for example) could be highly optimized in its memory footprint if the vertex and edge data
were held in a form which minimized Java object usage.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message