Return-Path: X-Original-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFDA497DF for ; Thu, 19 Apr 2012 19:11:06 +0000 (UTC) Received: (qmail 163 invoked by uid 500); 19 Apr 2012 19:11:06 -0000 Delivered-To: apmail-incubator-giraph-dev-archive@incubator.apache.org Received: (qmail 114 invoked by uid 500); 19 Apr 2012 19:11:06 -0000 Mailing-List: contact giraph-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-dev@incubator.apache.org Delivered-To: mailing list giraph-dev@incubator.apache.org Received: (qmail 106 invoked by uid 99); 19 Apr 2012 19:11:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 19:11:06 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 19:11:01 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 77EB03A4BC1 for ; Thu, 19 Apr 2012 19:10:40 +0000 (UTC) Date: Thu, 19 Apr 2012 19:10:40 +0000 (UTC) From: "Paolo Castagna (Commented) (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: <104148896.7024.1334862640492.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <338333542.18708.1333651107097.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257695#comment-13257695 ] Paolo Castagna commented on GIRAPH-170: --------------------------------------- Hi Benjamin > I call this the RDFAdjacencyCSV We came to the same conclusion. I ended up using Turtle for this, as explained here: http://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201204.mbox/%3C4F84872E.4050101%40googlemail.com%3E Turtle isn't splittable in general, but it can be made so simply writing all the RDF statements with the same subject on a single line. > I would like to say that Paolos suggestion of providing some ready made code for Pig, HBase and MapReduce for processing RDF sounds like a really great contribution. I am not sure what's the best place to put such code, I started with sharing small examples and experiments on GitHub, here: https://github.com/castagna/jena-grande > Integration of RDF reasoning capabilities: I will need to perform subclass reasoning on the DBPedia graph. See Apache Jena's RIOT infer command or a MapReduce version of it, here: https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/InferDriver.java I wonder if Giraph could be used to implement the RETE algorithm (http://en.wikipedia.org/wiki/Rete_algorithm) which is what Jena uses (with in memory RDF Jena models). > Workflow for loading RDF graph data into Giraph > ----------------------------------------------- > > Key: GIRAPH-170 > URL: https://issues.apache.org/jira/browse/GIRAPH-170 > Project: Giraph > Issue Type: New Feature > Reporter: Dan Brickley > Priority: Minor > > W3C RDF provides a family of Web standards for exchanging graph-based data. RDF uses sets of simple binary relationships, labeling nodes and links with Web identifiers (URIs). Many public datasets are available as RDF, including the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many such datasets are listed at http://thedatahub.org/ > RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple line-oriented format is N-Triples. A format aligned with RDF's SPARQL query language is Turtle. Apache Jena and Any23 provide software to handle all these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/ > This JIRA leaves open the strategy for loading RDF data into Giraph. There are various possibilites, including exploitation of intermediate Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a more Giraph-friendly form, or writing custom loaders. Even a HOWTO document or implementor notes here would be an advance on the current state of the art. The BluePrints Graph API (Gremlin etc.) has also been aligned with various RDF datasources. > Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue (since we can't currently easily represent fully general RDF graphs since two nodes might be connected by more than one typed edge). Even without multigraphs it ought to be possible to bring RDF-sourced data > into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset of a big RDF collection. > From Avery in email: "a helper VertexInputFormat (and maybe VertexOutputFormat) would certainly [despite GIRAPH-141] still help" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira