jena-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r931174 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/io.html
Date Mon, 01 Dec 2014 14:51:35 GMT
Author: buildbot
Date: Mon Dec  1 14:51:35 2014
New Revision: 931174

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/io.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Dec  1 14:51:35 2014
@@ -1 +1 @@
-1642168
+1642695

Modified: websites/staging/jena/trunk/content/documentation/hadoop/io.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/io.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/io.html Mon Dec  1 14:51:35 2014
@@ -157,7 +157,10 @@
 </ul>
 </li>
 <li><a href="#output">Output</a><ul>
-<li><a href="#blank-nodes-in-output">Blank Nodes in Output</a></li>
+<li><a href="#blank-nodes-in-output">Blank Nodes in Output</a><ul>
+<li><a href="#blank-node-divergence-in-multi-stage-pipelines">Blank Node Divergence
in multi-stage pipelines</a></li>
+</ul>
+</li>
 <li><a href="#node-output-format">Node Output Format</a></li>
 </ul>
 </li>
@@ -166,16 +169,20 @@
 <li><a href="#output_1">Output</a></li>
 </ul>
 </li>
-<li><a href="#configuration-options">Configuration Options</a><ul>
+<li><a href="#job-setup">Job Setup</a><ul>
+<li><a href="#job-configuration-options">Job Configuration Options</a><ul>
 <li><a href="#input-lines-per-batch">Input Lines per Batch</a></li>
 <li><a href="#max-line-length">Max Line Length</a></li>
 <li><a href="#ignoring-bad-tuples">Ignoring Bad Tuples</a></li>
+<li><a href="#global-blank-node-identity">Global Blank Node Identity</a></li>
 <li><a href="#output-batch-size">Output Batch Size</a></li>
 </ul>
 </li>
 </ul>
 </li>
 </ul>
+</li>
+</ul>
 </div>
 <h1 id="background-on-hadoop-io">Background on Hadoop IO</h1>
 <p>If you are already familiar with the Hadoop IO paradigm then please skip this section,
if not please read as otherwise some of the later information will not make much sense.</p>
@@ -184,8 +191,9 @@
 <p>In some cases there are file formats that may be processed in multiple ways i.e.
you can <em>split</em> them into pieces or you can process them as a whole.  Which
approach you wish to use will depend on whether you have a single file to process or many
files to process.  In the case of many files processing files as a whole may provide better
overall throughput than processing them as chunks.  However your mileage may vary especially
if your input data has many files of uneven size.</p>
 <h2 id="compressed-io">Compressed IO</h2>
 <p>Hadoop natively provides support for compressed input and output providing your
Hadoop cluster is appropriately configured.  The advantage of compressing the input/output
data is that it means there is less IO workload on the cluster however this comes with the
disadvantage that most compression formats block Hadoop's ability to <em>split</em>
up the input.</p>
+<p>Hadoop generally handles compression automatically and all our input and output
formats are capable of handling compressed input and output as necessary.</p>
 <h1 id="rdf-io-in-hadoop">RDF IO in Hadoop</h1>
-<p>There are a wide range of RDF serialisations supported by ARQ, please see the <a
href="../io/">RDF IO</a> for an overview of the formats that Jena supports.</p>
+<p>There are a wide range of RDF serialisations supported by ARQ, please see the <a
href="../io/">RDF IO</a> for an overview of the formats that Jena supports.  In this
section we go into a lot more depth of how exactly we support RDF IO in Hadoop.</p>
 <h2 id="input">Input</h2>
 <p>One of the difficulties posed when wrapping these for Hadoop IO is that the formats
have very different properties in terms of our ability to <em>split</em> them
into distinct chunks for Hadoop to process.  So we categorise the possible ways to process
RDF inputs as follows:</p>
 <ol>
@@ -213,6 +221,10 @@
 <p>As with input blank nodes provide a complicating factor in producing RDF output.
 For whole file output formats this is not an issue but it does need to be considered for
line and batch based formats.</p>
 <p>However what we have found in practise is that the Jena writers will predictably
map internal identifiers to the blank node identifiers in the output serialisations.  What
this means is that even when processing output in batches we've found that using the line/batch
based formats correctly preserve blank node identity.</p>
 <p>If you are concerned about potential data corruption as a result of this then you
should make sure to always choose a whole file output format but be aware that this can exhaust
memory if your output is large.</p>
+<h4 id="blank-node-divergence-in-multi-stage-pipelines">Blank Node Divergence in multi-stage
pipelines</h4>
+<p>The other thing to consider with regards to blank nodes in output is that Hadoop
will by default create multiple output files (one for each reducer) so even if consistent
and valid blank nodes are output they may be spread over multiple files.</p>
+<p>In multi-stage pipelines you will need to manually concatenate these files back
together (assuming they are in a format that allows this e.g. NTriples) as otherwise when
you pass them as input to the next job the blank node identifiers will diverge from each other.
 <a href="https://issues.apache.org/jira/browse/JENA-820">JENA-820</a> discusses
this problem and introduces a special configuration setting that can be used to resolve this.
 Note that even with this setting enabled some formats are not capable of respecting it, see
the later section on <a href="#job-configuration-options">Job Configuration Options</a>
for more details.</p>
+<p>An alternative workaround is to always use RDF Thrift as the intermediate output
format since it preserves blank node identifiers precisely as they are seen.  This also has
the advantage that RDF Thrift is extremely fast to read and write which can speed up multi-stage
pipelines considerably.</p>
 <h3 id="node-output-format">Node Output Format</h3>
 <p>We also include a special <code>NTriplesNodeOutputFormat</code> which
is capable of outputting pairs composed of a <code>NodeWritable</code> key and
any value type.  Think of this as being similar to the standard Hadoop <code>TextOutputFormat</code>
except it understands how to format nodes as valid NTriples serialisation.  This format is
useful when performing simple statistical analysis such as node usage counts or other calculations
over nodes.</p>
 <p>In the case where the value of the key value pair is also a RDF primitive proper
NTriples formatting is also applied to each of the nodes in the value</p>
@@ -275,9 +287,28 @@
   <tr><td>RDF Thrift</td><td>Yes</td><td>No</td><td>No</td></tr>
 </table>
 
-<h2 id="configuration-options">Configuration Options</h2>
+<h2 id="job-setup">Job Setup</h2>
+<p>To use RDF as an input and/or output format you will need to configure your Job
appropriately, this requires setting the input/output format and setting the data paths:</p>
+<div class="codehilite"><pre><span class="c1">// Create a job using default
configuration</span>
+<span class="n">Job</span> <span class="n">job</span> <span class="o">=</span>
<span class="n">Job</span><span class="p">.</span><span class="n">createInstance</span><span
class="p">(</span><span class="k">new</span> <span class="n">Configuration</span><span
class="p">(</span><span class="n">true</span><span class="p">));</span>
+
+<span class="c1">// Use Turtle as the input format</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setInputFormatClass</span><span
class="p">(</span><span class="n">TurtleInputFormat</span><span class="p">.</span><span
class="k">class</span><span class="p">);</span>
+<span class="n">FileInputFormat</span><span class="p">.</span><span
class="n">setInputPath</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="s">&quot;/users/example/input&quot;</span><span
class="p">);</span>
+
+<span class="c1">// Use NTriples as the output format</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setOutputFormatClass</span><span
class="p">(</span><span class="n">NTriplesOutputFormat</span><span
class="p">.</span><span class="k">class</span><span class="p">);</span>
+<span class="n">FileOutputFormat</span><span class="p">.</span><span
class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span
class="p">,</span> <span class="s">&quot;/users/example/output&quot;</span><span
class="p">);</span>
+
+<span class="c1">// Other job configuration...</span>
+</pre></div>
+
+
+<p>This example takes in input in Turtle format from the directory <code>/users/example/input</code>
and outputs the end results in NTriples in the directory <code>/users/example/output</code>.</p>
+<p>Take a look at the <a href="../javadoc/hadoop/io/">Javadocs</a> to find
the actual available input and output format implementations.</p>
+<h3 id="job-configuration-options">Job Configuration Options</h3>
 <p>There are a several useful configuration options that can be used to tweak the behaviour
of the RDF IO functionality if desired.</p>
-<h3 id="input-lines-per-batch">Input Lines per Batch</h3>
+<h4 id="input-lines-per-batch">Input Lines per Batch</h4>
 <p>Since our line based input formats use the standard Hadoop <code>NLineInputFormat</code>
to decide how to split up inputs we support the standard <code>mapreduce.input.lineinputformat.linespermap</code>
configuration setting for changing the number of lines processed per map.</p>
 <p>You can set this directly in your configuration:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span
class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span
class="p">(</span><span class="n">NLineInputFormat</span><span class="p">.</span><span
class="n">LINES_PER_MAP</span><span class="p">,</span> 100<span class="p">);</span>
@@ -289,19 +320,28 @@
 </pre></div>
 
 
-<h3 id="max-line-length">Max Line Length</h3>
+<h4 id="max-line-length">Max Line Length</h4>
 <p>When using line based inputs it may be desirable to ignore lines that exceed a certain
length (for example if you are not interested in really long literals).  Again we use the
standard Hadoop configuration setting <code>mapreduce.input.linerecordreader.line.maxlength</code>
to control this behaviour:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span
class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span
class="p">(</span><span class="n">HadoopIOConstants</span><span class="p">.</span><span
class="n">MAX_LINE_LENGTH</span><span class="p">,</span> 8192<span
class="p">);</span>
 </pre></div>
 
 
-<h3 id="ignoring-bad-tuples">Ignoring Bad Tuples</h3>
+<h4 id="ignoring-bad-tuples">Ignoring Bad Tuples</h4>
 <p>In many cases you may have data that you know contains invalid tuples, in such cases
it can be useful to just ignore the bad tuples and continue.  By default we enable this behaviour
and will skip over bad tuples though they will be logged as an error.  If you want you can
disable this behaviour by setting the <code>rdf.io.input.ignore-bad-tuples</code>
configuration setting:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span
class="n">getConfiguration</span><span class="p">().</span><span class="n">setBoolean</span><span
class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span
class="n">INPUT_IGNORE_BAD_TUPLES</span><span class="p">,</span> <span
class="n">false</span><span class="p">);</span>
 </pre></div>
 
 
-<h3 id="output-batch-size">Output Batch Size</h3>
+<h4 id="global-blank-node-identity">Global Blank Node Identity</h4>
+<p>The default behaviour of these libraries is to allocate file scoped blank node identifiers
in such a way that the same syntactic identifier read from the same file (even if by different
nodes/processes) is allocated the same blank node ID.  However the same syntactic identifier
in different files should result in different blank nodes.  However as discussed earlier in
the case of multi-stage jobs the intermediate outputs may be split over several files which
can cause the blank node identifiers to diverge from each other when they are read back in.</p>
+<p>For multi-stage jobs this is often (but not always) incorrect and undesirable behaviour
in which case you can set the <code>rdf.io.input.bnodes.global-identity</code>
property to true:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span
class="n">getConfiguration</span><span class="p">.</span><span class="n">setBoolean</span><span
class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span
class="n">GLOBAL_BNODE_IDENTITY</span><span class="p">,</span> <span
class="n">true</span><span class="p">);</span>
+</pre></div>
+
+
+<p>Note however that not all formats are capable of honouring this option, notably
RDF/XML and JSON-LD.</p>
+<p>As noted earlier an alternative workaround is to use RDF Thrift as the intermediate
format since it guarantees to preserve blank node identifiers precisely.</p>
+<h4 id="output-batch-size">Output Batch Size</h4>
 <p>The batch size for batched output formats can be controlled by setting the <code>rdf.io.output.batch-size</code>
property as desired.  The default value for this if not explicitly configured is 10,000:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span
class="n">getConfiguration</span><span class="p">.</span><span class="n">setInt</span><span
class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span
class="n">OUTPUT_BATCH_SIZE</span><span class="p">,</span> 25000<span
class="p">);</span>
 </pre></div>



Mime
View raw message