jena-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r940457 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/demo.md documentation/hadoop/mapred.html
Date Tue, 17 Feb 2015 12:22:03 GMT
Author: buildbot
Date: Tue Feb 17 12:22:03 2015
New Revision: 940457

Log:
Staging update by buildbot for jena

Added:
    websites/staging/jena/trunk/content/documentation/hadoop/demo.md
Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/mapred.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Feb 17 12:22:03 2015
@@ -1 +1 @@
-1660090
+1660361

Added: websites/staging/jena/trunk/content/documentation/hadoop/demo.md
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/demo.md (added)
+++ websites/staging/jena/trunk/content/documentation/hadoop/demo.md Tue Feb 17 12:22:03 2015
@@ -0,0 +1,106 @@
+Title: Apache Jena Elephas - RDF Stats Demo
+
+The RDF Stats Demo is a pre-built application available as a ready to run Hadoop Job JAR
with all dependencies embedded within it.  The demo app uses the other libraries to allow
calculating a number of basic statistics over any RDF data supported by Elephas.
+
+To use it you will first need to build it from source or download the relevant Maven artefact:
+
+    <dependency>
+      <groupId>org.apache.jena</groupId>
+      <artifactId>jena-elephas-stats</artifactId>
+      <version>x.y.z</version>
+      <classifier>hadoop-job</classifier>
+    </dependency>
+    
+Where `x.y.z` is the desired version.
+
+# Pre-requisites
+
+In order to run this demo you will need to have a Hadoop 2.x cluster available, for simple
experimentation purposes a [single node cluster](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html)
will be sufficient.
+
+# Running
+
+Assuming your cluster is started and running and the `hadoop` command is available on your
path you can run the application without any arguments to see help:
+
+    > hadoop jar jena-elephas-stats-VERSION-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats
+    NAME
+        hadoop jar PATH_TO_JAR org.apache.jena.hadoop.rdf.stats.RdfStats - A
+        command which computes statistics on RDF data using Hadoop
+
+    SYNOPSIS
+        hadoop jar PATH_TO_JAR org.apache.jena.hadoop.rdf.stats.RdfStats
+                [ {-a | --all} ] [ {-d | --data-types} ] [ {-g | --graph-sizes} ]
+                [ {-h | --help} ] [ --input-type <inputType> ] [ {-n | --node-count}
]
+                [ --namespaces ] {-o | --output} <OutputPath> [ {-t | --type-count}
]
+                [--] <InputPath>...
+
+    OPTIONS
+        -a, --all
+            Requests that all available statistics be calculated
+
+        -d, --data-types
+            Requests that literal data type usage counts be calculated
+
+        -g, --graph-sizes
+            Requests that the size of each named graph be counted
+
+        -h, --help
+            Display help information
+
+        --input-type <inputType>
+            Specifies whether the input data is a mixture of quads and triples,
+            just quads or just triples. Using the most specific data type will
+            yield the most accurate statistics
+
+            This options value is restricted to the following value(s):
+                mixed
+                quads
+                triples
+
+        -n, --node-count
+            Requests that node usage counts be calculated
+
+        --namespaces
+            Requests that namespace usage counts be calculated
+
+        -o <OutputPath>, --output <OutputPath>
+            Sets the output path
+
+        -t, --type-count
+            Requests that rdf:type usage counts be calculated
+
+        --
+            This option can be used to separate command-line options from the
+            list of argument, (useful when arguments might be mistaken for
+            command-line options)
+
+        <InputPath>
+            Sets the input path(s)
+
+If we wanted to calculate the node count on some data we could do the following:
+
+    > hadoop jar jena-elephas-stats-VERSION-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats
--node-count --output /example/output /example/input
+
+This calculates the node counts for the input data found in `/example/input` placing the
generated counts in `/example/output`
+
+## Specifying Inputs and Outputs
+
+Inputs are specified simply by providing one or more paths to the data you wish to analyse.
 You can provide directory paths in which case all files within the directory will be processed.
+
+To specify the output location use the `-o` or `--output` option followed by the desired
output path.
+
+By default the demo application assumes a mixture of quads and triples data, if you know
your data is only in triples/quads then you can use the `--input-type` argument followed by
`triples` or `quads` to indicate the type of your data.  Not doing this can skew some statistics
as the default is to assume mixed data and so all triples are upgraded into quads when calculating
the statistics.
+    
+## Available Statistics
+
+The following statistics are available and are activated by the relevant command line option:
+
+<table>
+  <tr><th>Command Line Option</th><th>Statistic</th><th>Description
& Notes</th></tr>
+  <tr><td>`-n` or `--node-count`</td><td>Node Count</td><td>Counts
the occurrences of each unique RDF term i.e. node in Jena parlance</td></tr>
+  <tr><td>`-t` or `--type-count`</td><td>Type Count</td><td>Counts
the occurrences of each declared `rdf:type` value</td></tr>
+  <tr><td>`-d` or `--data-types`</td><td>Data Type Count</td><td>Counts
the occurrences of each declared literal data type</td></tr>
+  <tr><td>`--namespaces`</td><td>Namespace Counts</td><td>Counts
the occurrences of namespaces within the data.<br />Namespaces are determined by splitting
URIs at the `#` fragment separator if present and if not the last `/` character
+  <tr><td>`-g` or `--graph-sizes`</td><td>Graph Sizes</td><td>Counts
the sizes of each graph declared in the data</td></tr>
+</table>
+
+You can also use the `-a` or `--all` option if you simply wish to calculate all statistics.
\ No newline at end of file

Modified: websites/staging/jena/trunk/content/documentation/hadoop/mapred.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/mapred.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/mapred.html Tue Feb 17 12:22:03
2015
@@ -163,6 +163,10 @@
 <li><a href="#transforming">Transforming</a></li>
 </ul>
 </li>
+<li><a href="#example-jobs">Example Jobs</a><ul>
+<li><a href="#node-count">Node Count</a></li>
+</ul>
+</li>
 </ul>
 </div>
 <h1 id="tasks">Tasks</h1>
@@ -175,6 +179,7 @@
 <li>Splitting</li>
 <li>Transforming</li>
 </ul>
+<p>Note that standard Map/Reduce programming rules apply as normal.  For example if
a mapper/reducer transforms between data types then you need to make <code>setMapOutputKeyClass()</code>,
<code>setMapOutputValueClass()</code>, <code>setReducerOutputKeyClass()</code>
and <code>setReducerOutputValueClass()</code> calls on your Job configuration
as necessary.</p>
 <h2 id="counting">Counting</h2>
 <p>Counting is one of the classic Map/Reduce tasks and features as both the official
Map/Reduce example for both Hadoop itself and for Elephas.  Implementations cover a number
of different counting tasks that you might want to carry out upon RDF data, in most cases
you will use the desired <code>Mapper</code> implementation in conjunction with
the <code>NodeCountReducer</code>.</p>
 <h3 id="node-usage">Node Usage</h3>
@@ -185,7 +190,11 @@
 <h3 id="namespaces">Namespaces</h3>
 <p>Finally you may be interested in the usage of namespaces within your data, in this
case the <code>TripleNamespaceCountMapper</code> or <code>QuadNamespaceCountMapper</code>
can be used to do this.  For this use case you should use the <code>TextCountReducer</code>
to total up the counts for each namespace.  Note that the mappers determine the namespace
for a URI simply by splitting after the last <code>#</code> or <code>/</code>
in the URI, if no such character exists then the full URI is considered to be the namespace.</p>
 <h2 id="filtering">Filtering</h2>
-<p>Filtering is another classic Map/Reduce use case, here you want to take the data
and extract only the portions that you are interested in based on some criteria.  All our
filter <code>Mapper</code> implementations also support a Job configuration option
named <code>rdf.mapreduce.filter.invert</code> allowing their effects to be inverted
if desired.</p>
+<p>Filtering is another classic Map/Reduce use case, here you want to take the data
and extract only the portions that you are interested in based on some criteria.  All our
filter <code>Mapper</code> implementations also support a Job configuration option
named <code>rdf.mapreduce.filter.invert</code> allowing their effects to be inverted
if desired e.g.</p>
+<div class="codehilite"><pre><span class="n">config</span><span
class="p">.</span><span class="n">setProperty</span><span class="p">(</span><span
class="n">RdfMapReduceConstants</span><span class="p">.</span><span
class="n">FILTER_INVERT</span><span class="p">,</span> <span class="n">true</span><span
class="p">);</span>
+</pre></div>
+
+
 <h3 id="valid-data">Valid Data</h3>
 <p>One type of filter that may be useful particularly if you are generating RDF data
that may not be strict RDF is the <code>ValidTripleFilterMapper</code> and the
<code>ValidQuadFilterMapper</code>.  These filters only keep triples/quads that
are valid according to strict RDF semantics i.e.</p>
 <ul>
@@ -194,12 +203,20 @@
 <li>Object can be a URI/Blank Node/Literal</li>
 <li>Graph can only be a URI or Blank Node</li>
 </ul>
-<p>If you wanted to extract only the bad data e.g. for debugging then you can of course
invert these filters by setting <code>rdf.mapreduce.filter.invert</code> to <code>true</code>.</p>
+<p>If you wanted to extract only the bad data e.g. for debugging then you can of course
invert these filters by setting <code>rdf.mapreduce.filter.invert</code> to <code>true</code>
as shown above.</p>
 <h3 id="ground-data">Ground Data</h3>
 <p>In some cases you may only be interesting in triples/quads that are grounded i.e.
don't contain blank nodes in which case the <code>GroundTripleFilterMapper</code>
and <code>GroundQuadFilterMapper</code> can be used.</p>
 <h3 id="data-with-a-specific-uri">Data with a specific URI</h3>
-<p>In lots of case you may want to extract only data where a specific URI occurs in
a specific position, for example if you wanted to extract all the <code>rdf:type</code>
declarations then you might want to use the <code>TripleFilterByPredicateUriMapper</code>
or <code>QuadFilterByPredicateUriMapper</code> as appropriate.  The job configuration
option <code>rdf.mapreduce.filter.predicate.uris</code> is used to provide a comma
separated list of the full URIs you want the filter to accept.</p>
-<p>Similar to the counting of node usage you can substitute <code>Predicate</code>
for <code>Subject</code>, <code>Object</code> or <code>Graph</code>
as desired.  You will also need to do this in the job configuration option, for example to
filter on subject URIs in quads use the <code>QuadFilterBySubjectUriMapper</code>
and the <code>rdf.mapreduce.filter.subject.uris</code> configuration option.</p>
+<p>In lots of case you may want to extract only data where a specific URI occurs in
a specific position, for example if you wanted to extract all the <code>rdf:type</code>
declarations then you might want to use the <code>TripleFilterByPredicateUriMapper</code>
or <code>QuadFilterByPredicateUriMapper</code> as appropriate.  The job configuration
option <code>rdf.mapreduce.filter.predicate.uris</code> is used to provide a comma
separated list of the full URIs you want the filter to accept e.g.</p>
+<div class="codehilite"><pre><span class="n">config</span><span
class="p">.</span><span class="n">setProperty</span><span class="p">(</span><span
class="n">RdfMapReduceConstants</span><span class="p">.</span><span
class="n">FILTER_PREDICATE_URIS</span><span class="p">,</span> &quot;<span
class="n">http</span><span class="p">:</span><span class="o">//</span><span
class="n">example</span><span class="p">.</span><span class="n">org</span><span
class="o">/</span><span class="n">predicate</span><span class="p">,</span><span
class="n">http</span><span class="p">:</span><span class="o">//</span><span
class="n">another</span><span class="p">.</span><span class="n">org</span><span
class="o">/</span><span class="n">predicate</span>&quot;<span
class="p">);</span>
+</pre></div>
+
+
+<p>Similar to the counting of node usage you can substitute <code>Predicate</code>
for <code>Subject</code>, <code>Object</code> or <code>Graph</code>
as desired.  You will also need to do this in the job configuration option, for example to
filter on subject URIs in quads use the <code>QuadFilterBySubjectUriMapper</code>
and the <code>rdf.mapreduce.filter.subject.uris</code> configuration option e.g.</p>
+<div class="codehilite"><pre><span class="n">config</span><span
class="p">.</span><span class="n">setProperty</span><span class="p">(</span><span
class="n">RdfMapReduceConstants</span><span class="p">.</span><span
class="n">FILTER_SUBJECT_URIS</span><span class="p">,</span> &quot;<span
class="n">http</span><span class="p">:</span><span class="o">//</span><span
class="n">example</span><span class="p">.</span><span class="n">org</span><span
class="o">/</span><span class="n">myInstance</span>&quot;<span
class="p">);</span>
+</pre></div>
+
+
 <h2 id="grouping">Grouping</h2>
 <p>Grouping is again another frequent Map/Reduce use case, here we provide implementations
that allow you to group triples or quads by a specific RDF node within the triples/quads e.g.
by subject.  For example to group quads by predicate use the <code>QuadGroupByPredicateMapper</code>,
similar to filtering and counting you can substitute <code>Predicate</code> for
<code>Subject</code>, <code>Object</code> or <code>Graph</code>
if you wish to group by another node of the triple/quad.</p>
 <h2 id="splitting">Splitting</h2>
@@ -211,6 +228,45 @@
 <h2 id="transforming">Transforming</h2>
 <p>Transforming provides some very simple implementations that allow you to convert
between triples and quads.  For the lossy case of going from quads to triples simply use the
<code>QuadsToTriplesMapper</code>.</p>
 <p>If you want to go the other way - triples to quads - this requires adding a graph
field to each triple and we provide two implementations that do that.  Firstly there is <code>TriplesToQuadsBySubjectMapper</code>
which puts each triple into a graph based on its subject i.e. all triples with a common subject
go into a graph named for the subject.  Secondly there is <code>TriplesToQuadsConstantGraphMapper</code>
which simply puts all triples into the default graph, if you wish to change the target graph
you should extend this class.  If you wanted to select the graph to use based on some arbitrary
criteria you should look at extending the <code>AbstractTriplesToQuadsMapper</code>
instead.</p>
+<h1 id="example-jobs">Example Jobs</h1>
+<h2 id="node-count">Node Count</h2>
+<p>The following example shows how to configure a job which performs a node count i.e.
counts the usages of RDF terms (aka nodes in Jena parlance) within the data:</p>
+<div class="codehilite"><pre><span class="c1">// Assumes we have already
created a Hadoop Configuration </span>
+<span class="c1">// and stored it in the variable config</span>
+<span class="n">Job</span> <span class="n">job</span> <span class="o">=</span>
<span class="n">Job</span><span class="p">.</span><span class="n">getInstance</span><span
class="p">(</span><span class="k">config</span><span class="p">);</span>
+
+<span class="c1">// This is necessary as otherwise Hadoop won&#39;t ship the JAR
to all</span>
+<span class="c1">// nodes and you&#39;ll get ClassDefNotFound and similar errors</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setJarByClass</span><span
class="p">(</span><span class="n">Example</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+
+<span class="c1">// Give our job a friendly name</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setJobName</span><span
class="p">(</span><span class="s">&quot;RDF Triples Node Usage Count&quot;</span><span
class="p">);</span>
+
+<span class="c1">// Mapper class</span>
+<span class="c1">// Since the output type is different from the input type have to
declare</span>
+<span class="c1">// our output types</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setMapperClass</span><span
class="p">(</span><span class="n">TripleNodeCountMapper</span><span
class="p">.</span><span class="k">class</span><span class="p">);</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setMapOutputKeyClass</span><span
class="p">(</span><span class="n">NodeWritable</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setMapOutputValueClass</span><span
class="p">(</span><span class="n">LongWritable</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+
+<span class="c1">// Reducer class</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setReducerClass</span><span
class="p">(</span><span class="n">NodeCountReducer</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+
+<span class="c1">// Input</span>
+<span class="c1">// TriplesInputFormat accepts any RDF triples serialisation</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setInputFormatClass</span><span
class="p">(</span><span class="n">TriplesInputFormat</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+
+<span class="c1">// Output</span>
+<span class="c1">// NTriplesNodeOutputFormat produces lines consisting of a Node formatted</span>
+<span class="c1">// according to the NTriples spec and the value separated by a tab</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setOutputFormatClass</span><span
class="p">(</span><span class="n">NTriplesNodeOutputFormat</span><span
class="p">.</span><span class="k">class</span><span class="p">);</span>
+
+<span class="c1">// Set your input and output paths</span>
+<span class="n">FileInputFormat</span><span class="p">.</span><span
class="n">setInputPath</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="k">new</span> <span class="n">Path</span><span
class="p">(</span><span class="s">&quot;/example/input&quot;</span><span
class="p">));</span>
+<span class="n">FileOutputFormat</span><span class="p">.</span><span
class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="k">new</span> <span class="n">Path</span><span
class="p">(</span><span class="s">&quot;/example/output&quot;</span><span
class="p">));</span>
+
+<span class="c1">// Now run the job...</span>
+</pre></div>
   </div>
 </div>
 



Mime
View raw message