incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
Date Fri, 16 Mar 2012 07:07:44 GMT
If you found it useful, others might find it useful as well.  Please 
feel free to add to a JIRA.

Avery

On 3/15/12 4:44 AM, Dionysis Logothetis wrote:
> Ok, I've created an issue: 
> https://issues.apache.org/jira/browse/GIRAPH-155
> Feel free to edit if you think the description is not clear.
>
>
> By the way, I have also created a vertex reader that reads adjacency 
> lists but with no values for vertices and edges. That's also a format 
> that I've seen in several graph data sets. The vertex reader is 
> essentially a copy of the AdjacencyListVertexReader modified to handle 
> this format. It's basically an abstract class and subclasses can 
> override methods to provide default values for vertices and edges 
> (otherwise values are initialized to null), just like Avery described 
> below. If you think it's useful I can contribute this.
>
>
> On Wed, Mar 14, 2012 at 7:39 AM, Avery Ching <aching@apache.org 
> <mailto:aching@apache.org>> wrote:
>
>     Thanks for your input.  Response inline.
>
>     Avery
>
>
>     On 3/13/12 7:14 AM, Dionysios Logothetis wrote:
>>     Hi all,
>>     I'm a new Giraph user, and I'm facing a similar situation. My
>>     input graph is basically in the form of edges defined simply as a
>>     source and destination pair (optionally there could be an edge
>>     value). And these edges might be distributed across multiple
>>     files (this is actually a format I've seen in several graph data
>>     sets).
>>
>>     Without having looked at the internals of Giraph, I originally
>>     imagined that creating a MutableVertex and calling
>>     addVertexRequest for both vertices in an edge and addEdgeRequest
>>     from within the VertexReader would do the trick.
>>
>     I agree that this idea can work, we also have to have a default
>     vertex value in case folks add edges to a vertex index only.
>
>
>>     Now, this doesn't really work since there needs to be a graph
>>     state created in advance. The graph state is not created until
>>     all vertices have been loaded.
>     I wouldn't work about graph state here since it's the input
>     superstep.  We can set it for all vertices after creation if need be.
>
>>
>>     There's also another implication with
>>     potentially multiple workers trying to create the same vertex,
>>     but I think a vertex resolver can handle this, assuming the
>>     resolver is instantiated before the vertices are loaded.
>>
>     Yup.
>
>
>>     Is there a workaround to do this currently apart from
>>     pre-processing the graph?
>
>     Not currently.  Can you please open a JIRA on
>     https://issues.apache.org/jira/browse/GIRAPH to put track this
>     issue?  I think we should do it.
>
>
>>     Do you think it would be useful to have such functionality?
>
>     Yes!
>
>
>>     I think it makes sense to handle graph mutations either at the
>>     very beginning or during a execution in a uniform way. By the
>>     way, I'd be interested in contributing to the project.
>
>     We'd love to have your contributions, it's a great fit. =)
>
>>
>>     Looking forward to your response!
>>
>>     Thanks!
>>
>>
>>     On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching <aching@apache.org
>>     <mailto:aching@apache.org>> wrote:
>>
>>         Benjamin,
>>
>>         By the way, you're not the first to ask for a feature of this
>>         kind.  Perhaps we should consider an alternative format for
>>         loading input vertex data that is based on the edges or data
>>         of the vertices rather than totally vertex-centric.  We could
>>         load an edge, or a vertex value and join then all based on
>>         the vertex id.  Handling conflicts could be a little
>>         difficult, but perhaps the vertex resolver could handle this
>>         as well.
>>
>>         Avery
>>
>>
>>         On 3/12/12 12:41 PM, Benjamin Heitmann wrote:
>>
>>             On 12 Mar 2012, at 18:15, David Garcia wrote:
>>
>>                 Not sure what you're asking about.
>>                  getCurrentVertex() should only ever
>>                 create one vertex.  Presumably it returns this vertex
>>                 to the calling
>>                 function. . .which is called in loadVertices() I think.
>>
>>             Thanks David.
>>
>>             I am asking this question because I have a text input
>>             format which is very different from a node adjacency list.
>>             The most important difference, is that each line of the
>>             input file describes two nodes.
>>             The other important difference is that a node might be
>>             described on more then one line of the input.
>>
>>             I have multiple gigabits of input, so it would be very
>>             beneficial to directly load the input into Giraph.
>>             Otherwise the overhead of converting the input to some
>>             sort of node adjacency list is so big,
>>             that it might be a show-stopper regarding the suitability
>>             of Giraph.
>>
>>
>>
>>
>>
>>
>>
>>             For more details, here is the text from my previous
>>             email:   =========================[snip]===========
>>
>>             I am wondering if it would be possible to parse RDF input
>>             files from a TextInputFormat class.
>>
>>             The most suitable text format for RDF is called
>>             "NTriples", and it has this very simple format:
>>
>>             subject1 predicate1 object1 .\n
>>             subject1 predicate2 object2 .\n
>>             ...
>>
>>             So each line contains the subject, which is a vertex, a
>>             predicate, which is a typed edge, and the object, which
>>             is another vertex.
>>             Then the line is terminated by a dot and a new-line.
>>
>>             In Giraph terms, the result of parsing the first line
>>             would be the creation of a vertex for subject1 with an
>>             edge of type predicate1,
>>             and then the creation of a second vertex for object1. So
>>             two vertices need to be created for that one line.
>>
>>             Now the second line contains more information about the
>>             vertex subject1.
>>             So in Giraph terms, the vertex which was created for
>>             subject1 needs to be retrieved/revisited and an edge of
>>             type predicate2,
>>             which points to the new vertex object2 needs to be
>>             created. And vertex object2 needs to be created.
>>
>>             Just to point it out, such RDF NTriples files are
>>             unsorted, so information about the same vertex might
>>             appear e.g. at the first and at the last line
>>             of a multiple GB big file.
>>
>>             Which interface can be used in a
>>             TextInputFormat/VertexReader in order to find an already
>>             created vertex ?
>>
>>             Are there any other issues when
>>             VertexReader.getCurrentVertex() creates two vertices at
>>             the same time ?
>>
>>
>>             A second related question:
>>             If I have multiple formats for my input files, how would
>>             I implement that ?
>>             Just by adding a switch to the logic in
>>             getCurrentVertex() ? Or is there a better way to switch
>>             the input logic based on the file type ?
>>             All my input files would result in the same kind of
>>             Vertex being created.
>>
>>
>>             My motivation for doing this, in short:
>>             I have a large amount of RDF NTriples data which is
>>             provided by DBPedia. It amounts to somewhere between 5 GB
>>             and 20 GB,
>>             depending on which subset is used. Expressing this RDF
>>             data, so that each vertex is completely described in one
>>             text line,
>>             would require me to load it into an RDF store first, and
>>             then reprocess the data. In terms of RDF stores, that is
>>             already a non-trivial amount of data
>>             requiring quite a bit of hardware and tweaking. That is
>>             the reason why it would be valuable to directly load the
>>             RDF data into Giraph.
>>
>>
>>
>>
>>
>>
>
>


Mime
View raw message