jena-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r930551 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/artifacts.html documentation/hadoop/index.html
Date Wed, 26 Nov 2014 09:56:49 GMT
Author: buildbot
Date: Wed Nov 26 09:56:48 2014
New Revision: 930551

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html
    websites/staging/jena/trunk/content/documentation/hadoop/index.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 26 09:56:48 2014
@@ -1 +1 @@
-1641783
+1641785

Modified: websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html Wed Nov 26 09:56:48
2014
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - Maven Artifacts for Jena RDF Tools for Hadoop</title>
+  <title>Apache Jena - Maven Artifacts for Jena RDF Tools for Apache Hadoop</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -145,10 +145,33 @@
 	<div class="row">
 	<div class="col-md-12">
 	<div id="breadcrumbs"></div>
-	<h1 class="title">Maven Artifacts for Jena RDF Tools for Hadoop</h1>
+	<h1 class="title">Maven Artifacts for Jena RDF Tools for Apache Hadoop</h1>
   <p>The Jena RDF Tools for Hadoop libraries are a collection of maven artifacts which
can be used individually
 or together as desired.  These are available from the same locations as any other Jena
 artifact, see <a href="/download/maven.html">Using Jena with Maven</a> for more
information.</p>
+<h1 id="hadoop-dependencies">Hadoop Dependencies</h1>
+<p>The first thing to note is that although our libraries depend on relevant Hadoop
libraries these dependencies
+are marked as <code>provided</code> and therefore are not transitive.  This means
that you may typically also need to 
+declare these basic dependencies as <code>provided</code> in your own POM:</p>
+<div class="codehilite"><pre><span class="c">&lt;!-- Hadoop Dependencies
--&gt;</span>
+<span class="c">&lt;!-- Note these will be provided on the Hadoop cluster hence
the provided </span>
+<span class="c">        scope --&gt;</span>
+<span class="nt">&lt;dependency&gt;</span>
+  <span class="nt">&lt;groupId&gt;</span>org.apache.hadoop<span class="nt">&lt;/groupId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>hadoop-common<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;version&gt;</span>2.5.1<span class="nt">&lt;/version&gt;</span>
+  <span class="nt">&lt;scope&gt;</span>provided<span class="nt">&lt;/scope&gt;</span>
+<span class="nt">&lt;/dependency&gt;</span>
+<span class="nt">&lt;dependency&gt;</span>
+  <span class="nt">&lt;groupId&gt;</span>org.apache.hadoop<span class="nt">&lt;/groupId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>hadoop-mapreduce-client-common<span
class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;version&gt;</span>2.5.1<span class="nt">&lt;/version&gt;</span>
+  <span class="nt">&lt;scope&gt;</span>provided<span class="nt">&lt;/scope&gt;</span>
+<span class="nt">&lt;/dependency&gt;</span>
+</pre></div>
+
+
+<h1 id="jena-rdf-tools-for-apache-hadoop-artifacts">Jena RDF Tools for Apache Hadoop
Artifacts</h1>
 <h2 id="common-api">Common API</h2>
 <p>The <code>jena-hadoop-rdf-common</code> artifact provides common classes
for enabling RDF on Hadoop.  This is mainly
 composed of relevant <code>Writable</code> implementations for the various supported
RDF primitives.</p>

Modified: websites/staging/jena/trunk/content/documentation/hadoop/index.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/index.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/index.html Wed Nov 26 09:56:48
2014
@@ -169,7 +169,7 @@ underlying plumbing.</p>
 <li><a href="artifacts.html">Maven Artifacts for Jena JDBC</a></li>
 </ul>
 <h2 id="overview">Overview</h2>
-<p>RDF Tools for Apache Hadoop is published as a set of Maven module via its <a
href="artifacts.html">maven artifacts</a>.  The source for this libraries
+<p>RDF Tools for Apache Hadoop is published as a set of Maven module via its <a
href="artifacts.html">maven artifacts</a>.  The source for these libraries
 may be <a href="/download/index.cgi">downloaded</a> as part of the source distribution.
 These modules are built against the Hadoop 2.x. APIs and no
 backwards compatibility for 1.x is provided.</p>
 <p>The core aim of these libraries it to provide the basic building blocks that allow
users to start writing Hadoop applications that
@@ -201,6 +201,150 @@ on what you are trying to do.  Typically
 </pre></div>
 
 
+<p>Our libraries depend on the relevant Hadoop libraries but since these libraries
are provided by the cluster those dependencies are marked as <code>provided</code>
and thus are not transitive.  This means that you will typically also need to add the following
additional dependencies:</p>
+<div class="codehilite"><pre><span class="c">&lt;!-- Hadoop Dependencies
--&gt;</span>
+<span class="c">&lt;!-- Note these will be provided on the Hadoop cluster hence
the provided </span>
+<span class="c">        scope --&gt;</span>
+<span class="nt">&lt;dependency&gt;</span>
+  <span class="nt">&lt;groupId&gt;</span>org.apache.hadoop<span class="nt">&lt;/groupId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>hadoop-common<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;version&gt;</span>2.5.1<span class="nt">&lt;/version&gt;</span>
+  <span class="nt">&lt;scope&gt;</span>provided<span class="nt">&lt;/scope&gt;</span>
+<span class="nt">&lt;/dependency&gt;</span>
+<span class="nt">&lt;dependency&gt;</span>
+  <span class="nt">&lt;groupId&gt;</span>org.apache.hadoop<span class="nt">&lt;/groupId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>hadoop-mapreduce-client-common<span
class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;version&gt;</span>2.5.1<span class="nt">&lt;/version&gt;</span>
+  <span class="nt">&lt;scope&gt;</span>provided<span class="nt">&lt;/scope&gt;</span>
+<span class="nt">&lt;/dependency&gt;</span>
+</pre></div>
+
+
+<p>You can then write code to launch a Map/Reduce job that works with RDF.  For example
let us consider a RDF variation of the classic Hadoop
+word count example.  In this example which we call node count we do the following:</p>
+<ul>
+<li>Take in some RDF triples</li>
+<li>Split them up into their constituent nodes i.e. the URIs, Blank Nodes &amp;
Literals</li>
+<li>Assign an initial count of one to each node</li>
+<li>Group by node and sum up the counts</li>
+<li>Output the nodes and their usage counts</li>
+</ul>
+<p>We will start with our <code>Mapper</code> implementation, as you can
see this simply takes in a triple and splits it into its constituent nodes.  It
+then outputs each node with an initial count of 1:</p>
+<div class="codehilite"><pre><span class="n">package</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">jena</span><span class="p">.</span><span
class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span
class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span
class="n">count</span><span class="p">;</span>
+
+<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span
class="p">.</span><span class="n">rdf</span><span class="p">.</span><span
class="n">types</span><span class="p">.</span><span class="n">NodeWritable</span><span
class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span
class="p">.</span><span class="n">rdf</span><span class="p">.</span><span
class="n">types</span><span class="p">.</span><span class="n">TripleWritable</span><span
class="p">;</span>
+<span class="n">import</span> <span class="n">com</span><span
class="p">.</span><span class="n">hp</span><span class="p">.</span><span
class="n">hpl</span><span class="p">.</span><span class="n">jena</span><span
class="p">.</span><span class="n">graph</span><span class="p">.</span><span
class="n">Triple</span><span class="p">;</span>
+
+<span class="o">/**</span>
+ <span class="o">*</span> <span class="n">A</span> <span class="n">mapper</span>
<span class="k">for</span> <span class="n">counting</span> <span
class="n">node</span> <span class="n">usages</span> <span class="n">within</span>
<span class="n">triples</span> <span class="n">designed</span> <span
class="n">primarily</span> <span class="k">for</span> <span class="n">use</span>
+ <span class="o">*</span> <span class="n">in</span> <span class="n">conjunction</span>
<span class="n">with</span> <span class="p">{@</span><span class="n">link</span>
<span class="n">NodeCountReducer</span><span class="p">}</span>
+ <span class="o">*</span> 
+ <span class="o">*</span> 
+ <span class="o">*</span> 
+ <span class="o">*</span> <span class="p">@</span><span class="n">param</span>
<span class="o">&lt;</span><span class="n">TKey</span><span
class="o">&gt;</span> <span class="n">Key</span> <span class="n">type</span>
+ <span class="o">*/</span>
+<span class="n">public</span> <span class="n">class</span> <span
class="n">TripleNodeCountMapper</span><span class="o">&lt;</span><span
class="n">TKey</span><span class="o">&gt;</span> <span class="n">extends</span>
<span class="n">AbstractNodeTupleNodeCountMapper</span><span class="o">&lt;</span><span
class="n">TKey</span><span class="p">,</span> <span class="n">Triple</span><span
class="p">,</span> <span class="n">TripleWritable</span><span class="o">&gt;</span>
<span class="p">{</span>
+
+    <span class="p">@</span><span class="n">Override</span>
+    <span class="n">protected</span> <span class="n">NodeWritable</span><span
class="p">[]</span> <span class="n">getNodes</span><span class="p">(</span><span
class="n">TripleWritable</span> <span class="n">tuple</span><span
class="p">)</span> <span class="p">{</span>
+        <span class="n">Triple</span> <span class="n">t</span> <span
class="p">=</span> <span class="n">tuple</span><span class="p">.</span><span
class="n">get</span><span class="p">();</span>
+        <span class="k">return</span> <span class="n">new</span>
<span class="n">NodeWritable</span><span class="p">[]</span> <span
class="p">{</span> <span class="n">new</span> <span class="n">NodeWritable</span><span
class="p">(</span><span class="n">t</span><span class="p">.</span><span
class="n">getSubject</span><span class="p">()),</span> 
+                                    <span class="n">new</span> <span class="n">NodeWritable</span><span
class="p">(</span><span class="n">t</span><span class="p">.</span><span
class="n">getPredicate</span><span class="p">()),</span>
+                                    <span class="n">new</span> <span class="n">NodeWritable</span><span
class="p">(</span><span class="n">t</span><span class="p">.</span><span
class="n">getObject</span><span class="p">())</span> <span class="p">};</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>And then our <code>Reducer</code> implementation, this takes in the
data grouped by node and sums up the counts outputting the node and the final count:</p>
+<div class="codehilite"><pre><span class="n">package</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">jena</span><span class="p">.</span><span
class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span
class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span
class="n">count</span><span class="p">;</span>
+
+<span class="n">import</span> <span class="n">java</span><span
class="p">.</span><span class="n">io</span><span class="p">.</span><span
class="n">IOException</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">java</span><span
class="p">.</span><span class="n">util</span><span class="p">.</span><span
class="n">Iterator</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">hadoop</span><span class="p">.</span><span class="n">io</span><span
class="p">.</span><span class="n">LongWritable</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">hadoop</span><span class="p">.</span><span class="n">mapreduce</span><span
class="p">.</span><span class="n">Reducer</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span
class="p">.</span><span class="n">rdf</span><span class="p">.</span><span
class="n">types</span><span class="p">.</span><span class="n">NodeWritable</span><span
class="p">;</span>
+
+<span class="o">/**</span>
+ <span class="o">*</span> <span class="n">A</span> <span class="n">reducer</span>
<span class="n">which</span> <span class="n">takes</span> <span
class="n">node</span> <span class="n">keys</span> <span class="n">with</span>
<span class="n">a</span> <span class="n">sequence</span> <span
class="n">of</span> <span class="n">longs</span> <span class="n">representing</span>
<span class="n">counts</span>
+ <span class="o">*</span> <span class="n">as</span> <span class="n">the</span>
<span class="n">values</span> <span class="n">and</span> <span
class="n">sums</span> <span class="n">the</span> <span class="n">counts</span>
<span class="n">together</span> <span class="n">into</span> <span
class="n">pairs</span> <span class="n">consisting</span> <span class="n">of</span>
<span class="n">a</span> <span class="n">node</span>
+ <span class="o">*</span> <span class="n">key</span> <span class="n">and</span>
<span class="n">a</span> <span class="n">count</span> <span class="n">value</span><span
class="p">.</span>
+ <span class="o">*/</span>
+<span class="n">public</span> <span class="n">class</span> <span
class="n">NodeCountReducer</span> <span class="n">extends</span> <span
class="n">Reducer</span><span class="o">&lt;</span><span class="n">NodeWritable</span><span
class="p">,</span> <span class="n">LongWritable</span><span class="p">,</span>
<span class="n">NodeWritable</span><span class="p">,</span> <span
class="n">LongWritable</span><span class="o">&gt;</span> <span
class="p">{</span>
+
+    <span class="p">@</span><span class="n">Override</span>
+    <span class="n">protected</span> <span class="n">void</span>
<span class="n">reduce</span><span class="p">(</span><span class="n">NodeWritable</span>
<span class="n">key</span><span class="p">,</span> <span class="n">Iterable</span><span
class="o">&lt;</span><span class="n">LongWritable</span><span
class="o">&gt;</span> <span class="n">values</span><span class="p">,</span>
<span class="n">Context</span> <span class="n">context</span><span
class="p">)</span> <span class="n">throws</span> <span class="n">IOException</span><span
class="p">,</span>
+            <span class="n">InterruptedException</span> <span class="p">{</span>
+        <span class="n">long</span> <span class="n">count</span>
<span class="p">=</span> 0<span class="p">;</span>
+        <span class="n">Iterator</span><span class="o">&lt;</span><span
class="n">LongWritable</span><span class="o">&gt;</span> <span
class="n">iter</span> <span class="p">=</span> <span class="n">values</span><span
class="p">.</span><span class="n">iterator</span><span class="p">();</span>
+        <span class="k">while</span> <span class="p">(</span><span
class="n">iter</span><span class="p">.</span><span class="n">hasNext</span><span
class="p">())</span> <span class="p">{</span>
+            <span class="n">count</span> <span class="o">+</span><span
class="p">=</span> <span class="n">iter</span><span class="p">.</span><span
class="n">next</span><span class="p">().</span><span class="n">get</span><span
class="p">();</span>
+        <span class="p">}</span>
+        <span class="n">context</span><span class="p">.</span><span
class="n">write</span><span class="p">(</span><span class="n">key</span><span
class="p">,</span> <span class="n">new</span> <span class="n">LongWritable</span><span
class="p">(</span><span class="n">count</span><span class="p">));</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>Finally we then need to define an actual Hadoop job we can submit to run this. 
Here we take advantage of the <a href="io.html">IO</a> library to provide
+us with support for our desired RDF input format:</p>
+<div class="codehilite"><pre><span class="n">package</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">jena</span><span class="p">.</span><span
class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span
class="p">.</span><span class="n">stats</span><span class="p">;</span>
+</pre></div>
+
+
+<p>import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.jena.hadoop.rdf.io.input.TriplesInputFormat;
+import org.apache.jena.hadoop.rdf.io.output.ntriples.NTriplesNodeOutputFormat;
+import org.apache.jena.hadoop.rdf.mapreduce.count.NodeCountReducer;
+import org.apache.jena.hadoop.rdf.mapreduce.count.TripleNodeCountMapper;
+import org.apache.jena.hadoop.rdf.types.NodeWritable;</p>
+<p>public class RdfMapReduceExample {</p>
+<div class="codehilite"><pre><span class="n">public</span> <span
class="n">static</span> <span class="n">void</span> <span class="n">main</span><span
class="p">(</span><span class="n">String</span><span class="p">[]</span>
<span class="n">args</span><span class="p">)</span> <span class="p">{</span>
+    <span class="k">try</span> <span class="p">{</span>
+        <span class="o">//</span> <span class="n">Get</span> <span
class="n">Hadoop</span> <span class="n">configuration</span>
+        <span class="n">Configuration</span> <span class="n">config</span>
<span class="p">=</span> <span class="n">new</span> <span class="n">Configuration</span><span
class="p">(</span><span class="n">true</span><span class="p">);</span>
+
+        <span class="o">//</span> <span class="n">Create</span> <span
class="n">job</span>
+        <span class="n">Job</span> <span class="n">job</span> <span
class="p">=</span> <span class="n">Job</span><span class="p">.</span><span
class="n">getInstance</span><span class="p">(</span><span class="n">config</span><span
class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setJarByClass</span><span class="p">(</span><span class="n">RdfMapReduceExample</span><span
class="p">.</span><span class="n">class</span><span class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setJobName</span><span class="p">(</span>&quot;<span
class="n">RDF</span> <span class="n">Triples</span> <span class="n">Node</span>
<span class="n">Usage</span> <span class="n">Count</span>&quot;<span
class="p">);</span>
+
+        <span class="o">//</span> <span class="n">Map</span><span
class="o">/</span><span class="n">Reduce</span> <span class="n">classes</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setMapperClass</span><span class="p">(</span><span class="n">TripleNodeCountMapper</span><span
class="p">.</span><span class="n">class</span><span class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setMapOutputKeyClass</span><span class="p">(</span><span
class="n">NodeWritable</span><span class="p">.</span><span class="n">class</span><span
class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setMapOutputValueClass</span><span class="p">(</span><span
class="n">LongWritable</span><span class="p">.</span><span class="n">class</span><span
class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setReducerClass</span><span class="p">(</span><span class="n">NodeCountReducer</span><span
class="p">.</span><span class="n">class</span><span class="p">);</span>
+
+        <span class="o">//</span> <span class="n">Input</span> <span
class="n">and</span> <span class="n">Output</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setInputFormatClass</span><span class="p">(</span><span
class="n">TriplesInputFormat</span><span class="p">.</span><span class="n">class</span><span
class="p">);</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">setOutputFormatClass</span><span class="p">(</span><span
class="n">NTriplesNodeOutputFormat</span><span class="p">.</span><span
class="n">class</span><span class="p">);</span>
+        <span class="n">FileInputFormat</span><span class="p">.</span><span
class="n">setInputPaths</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span
class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span
class="o">/</span><span class="n">input</span><span class="o">/</span>&quot;<span
class="p">));</span>
+        <span class="n">FileOutputFormat</span><span class="p">.</span><span
class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span
class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span
class="o">/</span><span class="n">output</span><span class="o">/</span>&quot;<span
class="p">));</span>
+
+        <span class="o">//</span> <span class="n">Launch</span> <span
class="n">the</span> <span class="n">job</span> <span class="n">and</span>
<span class="n">await</span> <span class="n">completion</span>
+        <span class="n">job</span><span class="p">.</span><span
class="n">submit</span><span class="p">();</span>
+        <span class="k">if</span> <span class="p">(</span><span
class="n">job</span><span class="p">.</span><span class="n">monitorAndPrintJob</span><span
class="p">())</span> <span class="p">{</span>
+            <span class="o">//</span> <span class="n">OK</span>
+            <span class="n">System</span><span class="p">.</span><span
class="n">out</span><span class="p">.</span><span class="n">println</span><span
class="p">(</span>&quot;<span class="n">Completed</span>&quot;<span
class="p">);</span>
+        <span class="p">}</span> <span class="k">else</span> <span
class="p">{</span>
+            <span class="o">//</span> <span class="n">Failed</span>
+            <span class="n">System</span><span class="p">.</span><span
class="n">err</span><span class="p">.</span><span class="n">println</span><span
class="p">(</span>&quot;<span class="n">Failed</span>&quot;<span
class="p">);</span>
+        <span class="p">}</span>
+    <span class="p">}</span> <span class="k">catch</span> <span
class="p">(</span><span class="n">Throwable</span> <span class="n">e</span><span
class="p">)</span> <span class="p">{</span>
+        <span class="n">e</span><span class="p">.</span><span
class="n">printStackTrace</span><span class="p">();</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>}</p>
 <h2 id="apis">APIs</h2>
 <p>There are three main libraries each with their own API:</p>
 <ul>



Mime
View raw message