incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Heitmann (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
Date Thu, 19 Apr 2012 11:04:42 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257424#comment-13257424
] 

Benjamin Heitmann commented on GIRAPH-170:
------------------------------------------

Hello, 
and sorry for being late to contributing to this discussion. 

I am currently using Giraph to implement a graph based recommendation algorithm which uses
RDF data from DBPedia. I am not sure if that is enough of a use case for Paolo. 

Generally speaking, statistical analysis of semantic networks should be the most general motivation
for using Giraph on RDF. In other words: Since RDF has a native graph database model and RDF
processing needs to happen on web scale, Giraph could be a natural fit for processing RDF,
if it would support RDF input/ingestion in a native way. 

Regarding the fundamental capabilities required for parsing NTriple files with RDF: The TextInputFormat
needs a way to retrieve and alter already created nodes. Currently the assumption for the
TextInputFormat class, is that it will get exactly one line for each vertex to create. That
one line is assumed to hold *all* information necessary to create the vertex. 
However, the NTriples format does not work that way, as it can use multiple lines to describe
the same subject node. 

I already raised this issue on the user mailing list. (However I did not create a Jira issue
for it.) This is the fundamental capability which is lacking in Giraph. If this is enabled,
parsing NTriples will be easy. The starting points for the email threads in which this was
shortly discussed are in [1] and [2].

AFAIR, Dionysis Logothetis suggested that he may look into adding this capability to giraph.
So you might want to contact him directly to check on the progress. 

Now a few details on how I use RDF data for my Giraph job: 
Currently I use a subset of DBPedia, which is roughly 5.5GB unpacked. 
As this DBPedia subset stays static for all my recommendations, it is enough to preprocess
it once
using a quite simple MapReduce job. I basically join all lines on the subject of the triple,

and then output the following line for each subject: 
SubjectURI NumberOfOutLinks Predicate1 Object1 ... PredicateN ObjectN
(I call this the RDFAdjacencyCSV ;) 

For my specific algorithm, the direction of the the link in the RDF graph does not play any
role, 
so for each input triple, I add it once to the subject entity and once to the object entity.


The processing job took two days, but it was my first hadoop programm, so it probably was
inefficient.
The output size was 6GB. 

For running my algorithm, my Giraph job first loads the complete DBPedia dataset in memory.
While doing this it also loads the user profiles from via DistributedCache.getLocalCacheFiles(conf).
This is done in my own custom TextVertexInputFormat class. The profiles are used to prime
the graph, i.e. to identify the starting points for the algorithm. I also need to manage which
starting points belong to which user profiles.

Challenges which I will have in the near future: 
* Giraph does not seem to scale very well for my kind of data and processing: Independent
of the number of workers, my Giraph job only uses about 30% of a 24 node machine. And I would
like to utilise all available processing resources.
* Integration of RDF reasoning capabilities: I will need to perform subclass reasoning on
the DBPedia graph. The most pragmatic solution seems to be, to have an external RDF store
with reasoning, and to let the Giraph workers be able to query the RDF store.


 
[1] https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CE5D0BE74-7903-4145-BE10-52CBD6489AC8%40deri.org%3E
[2] https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CC6DA4465-B387-474A-B823-84019967DA3E%40deri.org%3E
                
> Workflow for loading RDF graph data into Giraph
> -----------------------------------------------
>
>                 Key: GIRAPH-170
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-170
>             Project: Giraph
>          Issue Type: New Feature
>            Reporter: Dan Brickley
>            Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. RDF uses
sets of simple binary relationships, labeling nodes and links with Web identifiers (URIs).
Many public datasets are available as RDF, including the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/
). Many such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple line-oriented
format is N-Triples. A format aligned with RDF's SPARQL query language is Turtle. Apache Jena
and Any23 provide software to handle all these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There are various
possibilites, including exploitation of intermediate Hadoop-friendly stores, or pre-processing
with e.g. Pig-based tools into a more Giraph-friendly form, or writing custom loaders. Even
a HOWTO document or implementor notes here would be an advance on the current state of the
art. The BluePrints Graph API (Gremlin etc.) has also been aligned with various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 touches
on the issue (since we can't currently easily represent fully general RDF graphs since two
nodes might be connected by more than one typed edge). Even without multigraphs it ought to
be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset
of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe VertexOutputFormat) would
certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message