Mailing-List: contact commits-help@jena.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jena.apache.org
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: svn commit: r1641785 - in
 /jena/site/trunk/content/documentation/hadoop:
 artifacts.mdtext index.mdtext
Date: Wed, 26 Nov 2014 09:56:39 -0000
To: commits@jena.apache.org
From: rvesse@apache.org
Message-Id: <20141126095639.930BB23888D2@eris.apache.org>

Author: rvesse
Date: Wed Nov 26 09:56:39 2014
New Revision: 1641785

URL: http://svn.apache.org/r1641785
Log:
Further work on RDF Tools for Hadoop documentation

Modified:
    jena/site/trunk/content/documentation/hadoop/artifacts.mdtext
    jena/site/trunk/content/documentation/hadoop/index.mdtext

Modified: jena/site/trunk/content/documentation/hadoop/artifacts.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/hadoop/artifacts.mdtext?rev=1641785&r1=1641784&r2=1641785&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/hadoop/artifacts.mdtext (original)
+++ jena/site/trunk/content/documentation/hadoop/artifacts.mdtext Wed Nov 26 09:56:39 2014
@@ -1,9 +1,33 @@
-Title: Maven Artifacts for Jena RDF Tools for Hadoop
+Title: Maven Artifacts for Jena RDF Tools for Apache Hadoop
 
 The Jena RDF Tools for Hadoop libraries are a collection of maven artifacts which can be used individually
 or together as desired.  These are available from the same locations as any other Jena
 artifact, see [Using Jena with Maven](/download/maven.html) for more information.
 
+# Hadoop Dependencies
+
+The first thing to note is that although our libraries depend on relevant Hadoop libraries these dependencies
+are marked as `provided` and therefore are not transitive.  This means that you may typically also need to 
+declare these basic dependencies as `provided` in your own POM:
+
+    <!-- Hadoop Dependencies -->
+    <!-- Note these will be provided on the Hadoop cluster hence the provided 
+            scope -->
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>2.5.1</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-common</artifactId>
+      <version>2.5.1</version>
+      <scope>provided</scope>
+    </dependency>
+
+# Jena RDF Tools for Apache Hadoop Artifacts
+
 ## Common API
 
 The `jena-hadoop-rdf-common` artifact provides common classes for enabling RDF on Hadoop.  This is mainly

Modified: jena/site/trunk/content/documentation/hadoop/index.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/hadoop/index.mdtext?rev=1641785&r1=1641784&r2=1641785&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/hadoop/index.mdtext (original)
+++ jena/site/trunk/content/documentation/hadoop/index.mdtext Wed Nov 26 09:56:39 2014
@@ -22,7 +22,7 @@ underlying plumbing.
 
 ## Overview
 
-RDF Tools for Apache Hadoop is published as a set of Maven module via its [maven artifacts](artifacts.html).  The source for this libraries
+RDF Tools for Apache Hadoop is published as a set of Maven module via its [maven artifacts](artifacts.html).  The source for these libraries
 may be [downloaded](/download/index.cgi) as part of the source distribution.  These modules are built against the Hadoop 2.x. APIs and no
 backwards compatibility for 1.x is provided.
 
@@ -60,6 +60,147 @@ on what you are trying to do.  Typically
       <version>x.y.z</version>
     </dependency>
 
+Our libraries depend on the relevant Hadoop libraries but since these libraries are provided by the cluster those dependencies are marked as `provided` and thus are not transitive.  This means that you will typically also need to add the following additional dependencies:
+
+    <!-- Hadoop Dependencies -->
+    <!-- Note these will be provided on the Hadoop cluster hence the provided 
+            scope -->
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>2.5.1</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-mapreduce-client-common</artifactId>
+      <version>2.5.1</version>
+      <scope>provided</scope>
+    </dependency>
+
+You can then write code to launch a Map/Reduce job that works with RDF.  For example let us consider a RDF variation of the classic Hadoop
+word count example.  In this example which we call node count we do the following:
+
+- Take in some RDF triples
+- Split them up into their constituent nodes i.e. the URIs, Blank Nodes & Literals
+- Assign an initial count of one to each node
+- Group by node and sum up the counts
+- Output the nodes and their usage counts
+
+We will start with our `Mapper` implementation, as you can see this simply takes in a triple and splits it into its constituent nodes.  It
+then outputs each node with an initial count of 1:
+
+    package org.apache.jena.hadoop.rdf.mapreduce.count;
+
+    import org.apache.jena.hadoop.rdf.types.NodeWritable;
+    import org.apache.jena.hadoop.rdf.types.TripleWritable;
+    import com.hp.hpl.jena.graph.Triple;
+
+    /**
+     * A mapper for counting node usages within triples designed primarily for use
+     * in conjunction with {@link NodeCountReducer}
+     * 
+     * 
+     * 
+     * @param <TKey> Key type
+     */
+    public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> {
+
+        @Override
+        protected NodeWritable[] getNodes(TripleWritable tuple) {
+            Triple t = tuple.get();
+            return new NodeWritable[] { new NodeWritable(t.getSubject()), 
+                                        new NodeWritable(t.getPredicate()),
+                                        new NodeWritable(t.getObject()) };
+        }
+    }
+
+And then our `Reducer` implementation, this takes in the data grouped by node and sums up the counts outputting the node and the final count:
+
+    package org.apache.jena.hadoop.rdf.mapreduce.count;
+
+    import java.io.IOException;
+    import java.util.Iterator;
+    import org.apache.hadoop.io.LongWritable;
+    import org.apache.hadoop.mapreduce.Reducer;
+    import org.apache.jena.hadoop.rdf.types.NodeWritable;
+
+    /**
+     * A reducer which takes node keys with a sequence of longs representing counts
+     * as the values and sums the counts together into pairs consisting of a node
+     * key and a count value.
+     */
+    public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, LongWritable> {
+
+        @Override
+        protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) throws IOException,
+                InterruptedException {
+            long count = 0;
+            Iterator<LongWritable> iter = values.iterator();
+            while (iter.hasNext()) {
+                count += iter.next().get();
+            }
+            context.write(key, new LongWritable(count));
+        }
+    }
+
+Finally we then need to define an actual Hadoop job we can submit to run this.  Here we take advantage of the [IO](io.html) library to provide
+us with support for our desired RDF input format:
+
+    package org.apache.jena.hadoop.rdf.stats;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.jena.hadoop.rdf.io.input.TriplesInputFormat;
+import org.apache.jena.hadoop.rdf.io.output.ntriples.NTriplesNodeOutputFormat;
+import org.apache.jena.hadoop.rdf.mapreduce.count.NodeCountReducer;
+import org.apache.jena.hadoop.rdf.mapreduce.count.TripleNodeCountMapper;
+import org.apache.jena.hadoop.rdf.types.NodeWritable;
+
+public class RdfMapReduceExample {
+
+    public static void main(String[] args) {
+        try {
+            // Get Hadoop configuration
+            Configuration config = new Configuration(true);
+
+            // Create job
+            Job job = Job.getInstance(config);
+            job.setJarByClass(RdfMapReduceExample.class);
+            job.setJobName("RDF Triples Node Usage Count");
+
+            // Map/Reduce classes
+            job.setMapperClass(TripleNodeCountMapper.class);
+            job.setMapOutputKeyClass(NodeWritable.class);
+            job.setMapOutputValueClass(LongWritable.class);
+            job.setReducerClass(NodeCountReducer.class);
+
+            // Input and Output
+            job.setInputFormatClass(TriplesInputFormat.class);
+            job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
+            FileInputFormat.setInputPaths(job, new Path("/example/input/"));
+            FileOutputFormat.setOutputPath(job, new Path("/example/output/"));
+
+            // Launch the job and await completion
+            job.submit();
+            if (job.monitorAndPrintJob()) {
+                // OK
+                System.out.println("Completed");
+            } else {
+                // Failed
+                System.err.println("Failed");
+            }
+        } catch (Throwable e) {
+            e.printStackTrace();
+        }
+    }
+}
+
+
 ## APIs
 
 There are three main libraries each with their own API:
@@ -70,3 +211,4 @@ There are three main libraries each with
 
 
 
+ 
\ No newline at end of file