giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
Date Mon, 12 Mar 2012 20:09:20 GMT

By the way, you're not the first to ask for a feature of this kind.  
Perhaps we should consider an alternative format for loading input 
vertex data that is based on the edges or data of the vertices rather 
than totally vertex-centric.  We could load an edge, or a vertex value 
and join then all based on the vertex id.  Handling conflicts could be a 
little difficult, but perhaps the vertex resolver could handle this as well.


On 3/12/12 12:41 PM, Benjamin Heitmann wrote:
> On 12 Mar 2012, at 18:15, David Garcia wrote:
>> Not sure what you're asking about.  getCurrentVertex() should only ever
>> create one vertex.  Presumably it returns this vertex to the calling
>> function. . .which is called in loadVertices() I think.
> Thanks David.
> I am asking this question because I have a text input format which is very different
from a node adjacency list.
> The most important difference, is that each line of the input file describes two nodes.
> The other important difference is that a node might be described on more then one line
of the input.
> I have multiple gigabits of input, so it would be very beneficial to directly load the
input into Giraph.
> Otherwise the overhead of converting the input to some sort of node adjacency list is
so big,
> that it might be a show-stopper regarding the suitability of Giraph.
> For more details, here is the text from my previous email:   =========================[snip]===========
> I am wondering if it would be possible to parse RDF input files from a TextInputFormat
> The most suitable text format for RDF is called "NTriples", and it has this very simple
> subject1 predicate1 object1 .\n
> subject1 predicate2 object2 .\n
> ...
> So each line contains the subject, which is a vertex, a predicate, which is a typed edge,
and the object, which is another vertex.
> Then the line is terminated by a dot and a new-line.
> In Giraph terms, the result of parsing the first line would be the creation of a vertex
for subject1 with an edge of type predicate1,
> and then the creation of a second vertex for object1. So two vertices need to be created
for that one line.
> Now the second line contains more information about the vertex subject1.
> So in Giraph terms, the vertex which was created for subject1 needs to be retrieved/revisited
and an edge of type predicate2,
> which points to the new vertex object2 needs to be created. And vertex object2 needs
to be created.
> Just to point it out, such RDF NTriples files are unsorted, so information about the
same vertex might appear e.g. at the first and at the last line
> of a multiple GB big file.
> Which interface can be used in a TextInputFormat/VertexReader in order to find an already
created vertex ?
> Are there any other issues when VertexReader.getCurrentVertex() creates two vertices
at the same time ?
> A second related question:
> If I have multiple formats for my input files, how would I implement that ?
> Just by adding a switch to the logic in getCurrentVertex() ? Or is there a better way
to switch the input logic based on the file type ?
> All my input files would result in the same kind of Vertex being created.
> My motivation for doing this, in short:
> I have a large amount of RDF NTriples data which is provided by DBPedia. It amounts to
somewhere between 5 GB and 20 GB,
> depending on which subset is used. Expressing this RDF data, so that each vertex is completely
described in one text line,
> would require me to load it into an RDF store first, and then reprocess the data. In
terms of RDF stores, that is already a non-trivial amount of data
> requiring quite a bit of hardware and tweaking. That is the reason why it would be valuable
to directly load the RDF data into Giraph.

View raw message