Return-Path: X-Original-To: apmail-jena-commits-archive@www.apache.org Delivered-To: apmail-jena-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B430C314 for ; Wed, 26 Nov 2014 09:57:01 +0000 (UTC) Received: (qmail 86406 invoked by uid 500); 26 Nov 2014 09:57:01 -0000 Delivered-To: apmail-jena-commits-archive@jena.apache.org Received: (qmail 86383 invoked by uid 500); 26 Nov 2014 09:57:01 -0000 Mailing-List: contact commits-help@jena.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jena.apache.org Delivered-To: mailing list commits@jena.apache.org Received: (qmail 86370 invoked by uid 99); 26 Nov 2014 09:57:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Nov 2014 09:57:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Nov 2014 09:56:59 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 930BB23888D2; Wed, 26 Nov 2014 09:56:39 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1641785 - in /jena/site/trunk/content/documentation/hadoop: artifacts.mdtext index.mdtext Date: Wed, 26 Nov 2014 09:56:39 -0000 To: commits@jena.apache.org From: rvesse@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20141126095639.930BB23888D2@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: rvesse Date: Wed Nov 26 09:56:39 2014 New Revision: 1641785 URL: http://svn.apache.org/r1641785 Log: Further work on RDF Tools for Hadoop documentation Modified: jena/site/trunk/content/documentation/hadoop/artifacts.mdtext jena/site/trunk/content/documentation/hadoop/index.mdtext Modified: jena/site/trunk/content/documentation/hadoop/artifacts.mdtext URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/hadoop/artifacts.mdtext?rev=1641785&r1=1641784&r2=1641785&view=diff ============================================================================== --- jena/site/trunk/content/documentation/hadoop/artifacts.mdtext (original) +++ jena/site/trunk/content/documentation/hadoop/artifacts.mdtext Wed Nov 26 09:56:39 2014 @@ -1,9 +1,33 @@ -Title: Maven Artifacts for Jena RDF Tools for Hadoop +Title: Maven Artifacts for Jena RDF Tools for Apache Hadoop The Jena RDF Tools for Hadoop libraries are a collection of maven artifacts which can be used individually or together as desired. These are available from the same locations as any other Jena artifact, see [Using Jena with Maven](/download/maven.html) for more information. +# Hadoop Dependencies + +The first thing to note is that although our libraries depend on relevant Hadoop libraries these dependencies +are marked as `provided` and therefore are not transitive. This means that you may typically also need to +declare these basic dependencies as `provided` in your own POM: + + + + + org.apache.hadoop + hadoop-common + 2.5.1 + provided + + + org.apache.hadoop + hadoop-mapreduce-client-common + 2.5.1 + provided + + +# Jena RDF Tools for Apache Hadoop Artifacts + ## Common API The `jena-hadoop-rdf-common` artifact provides common classes for enabling RDF on Hadoop. This is mainly Modified: jena/site/trunk/content/documentation/hadoop/index.mdtext URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/hadoop/index.mdtext?rev=1641785&r1=1641784&r2=1641785&view=diff ============================================================================== --- jena/site/trunk/content/documentation/hadoop/index.mdtext (original) +++ jena/site/trunk/content/documentation/hadoop/index.mdtext Wed Nov 26 09:56:39 2014 @@ -22,7 +22,7 @@ underlying plumbing. ## Overview -RDF Tools for Apache Hadoop is published as a set of Maven module via its [maven artifacts](artifacts.html). The source for this libraries +RDF Tools for Apache Hadoop is published as a set of Maven module via its [maven artifacts](artifacts.html). The source for these libraries may be [downloaded](/download/index.cgi) as part of the source distribution. These modules are built against the Hadoop 2.x. APIs and no backwards compatibility for 1.x is provided. @@ -60,6 +60,147 @@ on what you are trying to do. Typically x.y.z +Our libraries depend on the relevant Hadoop libraries but since these libraries are provided by the cluster those dependencies are marked as `provided` and thus are not transitive. This means that you will typically also need to add the following additional dependencies: + + + + + org.apache.hadoop + hadoop-common + 2.5.1 + provided + + + org.apache.hadoop + hadoop-mapreduce-client-common + 2.5.1 + provided + + +You can then write code to launch a Map/Reduce job that works with RDF. For example let us consider a RDF variation of the classic Hadoop +word count example. In this example which we call node count we do the following: + +- Take in some RDF triples +- Split them up into their constituent nodes i.e. the URIs, Blank Nodes & Literals +- Assign an initial count of one to each node +- Group by node and sum up the counts +- Output the nodes and their usage counts + +We will start with our `Mapper` implementation, as you can see this simply takes in a triple and splits it into its constituent nodes. It +then outputs each node with an initial count of 1: + + package org.apache.jena.hadoop.rdf.mapreduce.count; + + import org.apache.jena.hadoop.rdf.types.NodeWritable; + import org.apache.jena.hadoop.rdf.types.TripleWritable; + import com.hp.hpl.jena.graph.Triple; + + /** + * A mapper for counting node usages within triples designed primarily for use + * in conjunction with {@link NodeCountReducer} + * + * + * + * @param Key type + */ + public class TripleNodeCountMapper extends AbstractNodeTupleNodeCountMapper { + + @Override + protected NodeWritable[] getNodes(TripleWritable tuple) { + Triple t = tuple.get(); + return new NodeWritable[] { new NodeWritable(t.getSubject()), + new NodeWritable(t.getPredicate()), + new NodeWritable(t.getObject()) }; + } + } + +And then our `Reducer` implementation, this takes in the data grouped by node and sums up the counts outputting the node and the final count: + + package org.apache.jena.hadoop.rdf.mapreduce.count; + + import java.io.IOException; + import java.util.Iterator; + import org.apache.hadoop.io.LongWritable; + import org.apache.hadoop.mapreduce.Reducer; + import org.apache.jena.hadoop.rdf.types.NodeWritable; + + /** + * A reducer which takes node keys with a sequence of longs representing counts + * as the values and sums the counts together into pairs consisting of a node + * key and a count value. + */ + public class NodeCountReducer extends Reducer { + + @Override + protected void reduce(NodeWritable key, Iterable values, Context context) throws IOException, + InterruptedException { + long count = 0; + Iterator iter = values.iterator(); + while (iter.hasNext()) { + count += iter.next().get(); + } + context.write(key, new LongWritable(count)); + } + } + +Finally we then need to define an actual Hadoop job we can submit to run this. Here we take advantage of the [IO](io.html) library to provide +us with support for our desired RDF input format: + + package org.apache.jena.hadoop.rdf.stats; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.jena.hadoop.rdf.io.input.TriplesInputFormat; +import org.apache.jena.hadoop.rdf.io.output.ntriples.NTriplesNodeOutputFormat; +import org.apache.jena.hadoop.rdf.mapreduce.count.NodeCountReducer; +import org.apache.jena.hadoop.rdf.mapreduce.count.TripleNodeCountMapper; +import org.apache.jena.hadoop.rdf.types.NodeWritable; + +public class RdfMapReduceExample { + + public static void main(String[] args) { + try { + // Get Hadoop configuration + Configuration config = new Configuration(true); + + // Create job + Job job = Job.getInstance(config); + job.setJarByClass(RdfMapReduceExample.class); + job.setJobName("RDF Triples Node Usage Count"); + + // Map/Reduce classes + job.setMapperClass(TripleNodeCountMapper.class); + job.setMapOutputKeyClass(NodeWritable.class); + job.setMapOutputValueClass(LongWritable.class); + job.setReducerClass(NodeCountReducer.class); + + // Input and Output + job.setInputFormatClass(TriplesInputFormat.class); + job.setOutputFormatClass(NTriplesNodeOutputFormat.class); + FileInputFormat.setInputPaths(job, new Path("/example/input/")); + FileOutputFormat.setOutputPath(job, new Path("/example/output/")); + + // Launch the job and await completion + job.submit(); + if (job.monitorAndPrintJob()) { + // OK + System.out.println("Completed"); + } else { + // Failed + System.err.println("Failed"); + } + } catch (Throwable e) { + e.printStackTrace(); + } + } +} + + ## APIs There are three main libraries each with their own API: @@ -70,3 +211,4 @@ There are three main libraries each with + \ No newline at end of file