Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F7CE972F for ; Thu, 8 Mar 2012 15:15:17 +0000 (UTC) Received: (qmail 40988 invoked by uid 500); 8 Mar 2012 15:15:17 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 40792 invoked by uid 500); 8 Mar 2012 15:15:13 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 40745 invoked by uid 99); 8 Mar 2012 15:15:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Mar 2012 15:15:12 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [140.203.201.101] (HELO mx2.nuigalway.ie) (140.203.201.101) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Mar 2012 15:15:02 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApsEAFbMWE+My8qD/2dsb2JhbABCtjKCS4oannaYW4JUiniCP2MElUWQGIJj X-IronPort-AV: E=Sophos;i="4.73,552,1325462400"; d="scan'208";a="23207040" Received: from vmserver66.nuigalway.ie (HELO vmit04.deri.ie) ([140.203.202.131]) by mx2.nuigalway.ie with ESMTP; 08 Mar 2012 15:14:42 +0000 Received: from chapultepec.ie.deri.local (deri-dmz2.nuigalway.ie [140.203.154.5]) by vmit04.deri.ie (Postfix) with ESMTPSA id 96733213FA0 for ; Thu, 8 Mar 2012 14:58:04 +0000 (GMT) From: Benjamin Heitmann Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Question about TextInputFormat pattern for parsing e.g. RDF Date: Thu, 8 Mar 2012 15:14:41 +0000 Message-Id: To: giraph-user@incubator.apache.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org Hello again,=20 I am wondering if it would be possible to parse RDF input files from a = TextInputFormat class.=20 The most suitable text format for RDF is called "NTriples", and it has = this very simple format:=20 subject1 predicate1 object1 .\n subject1 predicate2 object2 .\n ... So each line contains the subject, which is a vertex, a predicate, which = is a typed edge, and the object, which is another vertex.=20 Then the line is terminated by a dot and a new-line.=20 In Giraph terms, the result of parsing the first line would be the = creation of a vertex for subject1 with an edge of type predicate1,=20 and then the creation of a second vertex for object1. So two vertices = need to be created for that one line.=20 Now the second line contains more information about the vertex subject1.=20= So in Giraph terms, the vertex which was created for subject1 needs to = be retrieved/revisited and an edge of type predicate2,=20 which points to the new vertex object2 needs to be created. And vertex = object2 needs to be created.=20 Just to point it out, such RDF NTriples files are unsorted, so = information about the same vertex might appear e.g. at the first and at = the last line=20 of a multiple GB big file.=20 Which interface can be used in a TextInputFormat/VertexReader in order = to find an already created vertex ?=20 Are there any other issues when VertexReader.getCurrentVertex() creates = two vertices at the same time ?=20 A second related question:=20 If I have multiple formats for my input files, how would I implement = that ?=20 Just by adding a switch to the logic in getCurrentVertex() ? Or is there = a better way to switch the input logic based on the file type ?=20 All my input files would result in the same kind of Vertex being = created.=20 My motivation for doing this, in short:=20 I have a large amount of RDF NTriples data which is provided by DBPedia. = It amounts to somewhere between 5 GB and 20 GB,=20 depending on which subset is used. Expressing this RDF data, so that = each vertex is completely described in one text line,=20 would require me to load it into an RDF store first, and then reprocess = the data. In terms of RDF stores, that is already a non-trivial amount = of data requiring quite a bit of hardware and tweaking. That is the reason why = it would be valuable to directly load the RDF data into Giraph.=20 cheers, Benjamin.=20