incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
Date Wed, 14 Mar 2012 06:39:27 GMT
Thanks for your input.  Response inline.


On 3/13/12 7:14 AM, Dionysios Logothetis wrote:
> Hi all,
> I'm a new Giraph user, and I'm facing a similar situation. My input 
> graph is basically in the form of edges defined simply as a source and 
> destination pair (optionally there could be an edge value). And these 
> edges might be distributed across multiple files (this is actually a 
> format I've seen in several graph data sets).
> Without having looked at the internals of Giraph, I originally 
> imagined that creating a MutableVertex and calling addVertexRequest 
> for both vertices in an edge and addEdgeRequest from within the 
> VertexReader would do the trick.
I agree that this idea can work, we also have to have a default vertex 
value in case folks add edges to a vertex index only.

> Now, this doesn't really work since there needs to be a graph state 
> created in advance. The graph state is not created until all vertices 
> have been loaded.
I wouldn't work about graph state here since it's the input superstep.  
We can set it for all vertices after creation if need be.
> There's also another implication with potentially multiple workers 
> trying to create the same vertex, but I think a vertex resolver can 
> handle this, assuming the resolver is instantiated before the vertices 
> are loaded.

> Is there a workaround to do this currently apart from pre-processing 
> the graph?

Not currently.  Can you please open a JIRA on to put track this issue?  I 
think we should do it.

> Do you think it would be useful to have such functionality?


> I think it makes sense to handle graph mutations either at the very 
> beginning or during a execution in a uniform way. By the way, I'd be 
> interested in contributing to the project.

We'd love to have your contributions, it's a great fit. =)
> Looking forward to your response!
> Thanks!
> On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching < 
> <>> wrote:
>     Benjamin,
>     By the way, you're not the first to ask for a feature of this
>     kind.  Perhaps we should consider an alternative format for
>     loading input vertex data that is based on the edges or data of
>     the vertices rather than totally vertex-centric.  We could load an
>     edge, or a vertex value and join then all based on the vertex id.
>      Handling conflicts could be a little difficult, but perhaps the
>     vertex resolver could handle this as well.
>     Avery
>     On 3/12/12 12:41 PM, Benjamin Heitmann wrote:
>         On 12 Mar 2012, at 18:15, David Garcia wrote:
>             Not sure what you're asking about.  getCurrentVertex()
>             should only ever
>             create one vertex.  Presumably it returns this vertex to
>             the calling
>             function. . .which is called in loadVertices() I think.
>         Thanks David.
>         I am asking this question because I have a text input format
>         which is very different from a node adjacency list.
>         The most important difference, is that each line of the input
>         file describes two nodes.
>         The other important difference is that a node might be
>         described on more then one line of the input.
>         I have multiple gigabits of input, so it would be very
>         beneficial to directly load the input into Giraph.
>         Otherwise the overhead of converting the input to some sort of
>         node adjacency list is so big,
>         that it might be a show-stopper regarding the suitability of
>         Giraph.
>         For more details, here is the text from my previous email:  
>         =========================[snip]===========
>         I am wondering if it would be possible to parse RDF input
>         files from a TextInputFormat class.
>         The most suitable text format for RDF is called "NTriples",
>         and it has this very simple format:
>         subject1 predicate1 object1 .\n
>         subject1 predicate2 object2 .\n
>         ...
>         So each line contains the subject, which is a vertex, a
>         predicate, which is a typed edge, and the object, which is
>         another vertex.
>         Then the line is terminated by a dot and a new-line.
>         In Giraph terms, the result of parsing the first line would be
>         the creation of a vertex for subject1 with an edge of type
>         predicate1,
>         and then the creation of a second vertex for object1. So two
>         vertices need to be created for that one line.
>         Now the second line contains more information about the vertex
>         subject1.
>         So in Giraph terms, the vertex which was created for subject1
>         needs to be retrieved/revisited and an edge of type predicate2,
>         which points to the new vertex object2 needs to be created.
>         And vertex object2 needs to be created.
>         Just to point it out, such RDF NTriples files are unsorted, so
>         information about the same vertex might appear e.g. at the
>         first and at the last line
>         of a multiple GB big file.
>         Which interface can be used in a TextInputFormat/VertexReader
>         in order to find an already created vertex ?
>         Are there any other issues when
>         VertexReader.getCurrentVertex() creates two vertices at the
>         same time ?
>         A second related question:
>         If I have multiple formats for my input files, how would I
>         implement that ?
>         Just by adding a switch to the logic in getCurrentVertex() ?
>         Or is there a better way to switch the input logic based on
>         the file type ?
>         All my input files would result in the same kind of Vertex
>         being created.
>         My motivation for doing this, in short:
>         I have a large amount of RDF NTriples data which is provided
>         by DBPedia. It amounts to somewhere between 5 GB and 20 GB,
>         depending on which subset is used. Expressing this RDF data,
>         so that each vertex is completely described in one text line,
>         would require me to load it into an RDF store first, and then
>         reprocess the data. In terms of RDF stores, that is already a
>         non-trivial amount of data
>         requiring quite a bit of hardware and tweaking. That is the
>         reason why it would be valuable to directly load the RDF data
>         into Giraph.

View raw message