Return-Path: X-Original-To: apmail-jena-commits-archive@www.apache.org Delivered-To: apmail-jena-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 86DC5D9ED for ; Tue, 1 Jan 2013 15:28:43 +0000 (UTC) Received: (qmail 42897 invoked by uid 500); 1 Jan 2013 15:28:43 -0000 Delivered-To: apmail-jena-commits-archive@jena.apache.org Received: (qmail 42881 invoked by uid 500); 1 Jan 2013 15:28:43 -0000 Mailing-List: contact commits-help@jena.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jena.apache.org Delivered-To: mailing list commits@jena.apache.org Received: (qmail 42874 invoked by uid 99); 1 Jan 2013 15:28:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jan 2013 15:28:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jan 2013 15:28:39 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 7CB6B2388A3D for ; Tue, 1 Jan 2013 15:28:19 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r844519 [2/6] - in /websites/staging/jena/trunk/content: ./ about_jena/ documentation/ documentation/assembler/ documentation/inference/ documentation/io/ documentation/javadoc/ documentation/larq/ documentation/notes/ documentation/ontolog... Date: Tue, 01 Jan 2013 15:28:12 -0000 To: commits@jena.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20130101152819.7CB6B2388A3D@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: websites/staging/jena/trunk/content/documentation/io/arp_howto.html ============================================================================== --- websites/staging/jena/trunk/content/documentation/io/arp_howto.html (added) +++ websites/staging/jena/trunk/content/documentation/io/arp_howto.html Tue Jan 1 15:28:10 2013 @@ -0,0 +1,809 @@ + + + + + + + + Apache Jena - Jena RDF/XML How-To + + + + + + + + + + +
+

Jena RDF/XML How-To

+

This is a guide to the RDF/XML I/O subsystem of Jena, ARP. +The first section gives a quick introduction to the +I/O subsystem. The other sections are aimed at users wishing to use +advanced features within the RDF/XML I/O subsystem.

+

Contents

+ +

Quick Introduction

+

The main I/O methods in Jena use InputStreams and OutputStreams. +This is import to correctly handle character sets.

+

These methods are found on the +Model interface. +These are:

+
    +
  • Model read(java.io.InputStream in, java.lang.String base) +
    Add statements from an RDF/XML serialization
  • +
  • Model read(java.io.InputStream in, java.lang.String base, java.lang.String lang) +
    Add RDF statements represented in language lang to the model.
  • +
  • Model read(java.lang.String url) +
    Add the RDF statements from an XML document.
  • +
  • Model write(java.io.OutputStream out) +
    Write the model as an XML document.
  • +
  • Model write(java.io.OutputStream out, java.lang.String lang) +
    Write a serialized representation of a model in a specified language.
  • +
  • Model write(java.io.OutputStream out, java.lang.String lang, java.lang.String base) +
    Write a serialized representation of a model in a specified language.
  • +
+

The built-in languages are "RDF/XML", "RDF/XML-ABBREV" as well as +"N-TRIPLE", and "TURTLE".

+

There are also methods which use Readers and Writers. Do not use +them, unless you are sure it is correct to. In advanced +applications, they are useful, see below; and +there is every intention to continue to support them. The RDF/XML +parser now checks to see if the Model.read(Reader …) calls +are being abused, and issues +ERR_ENCODING_MISMATCH +and +WARN_ENCODING_MISMATCH +errors. Most incorrect usage of Readers for RDF/XML input will +result in such errors. Most incorrect usage of Writers for RDF/XML +output will produce correct XML by using an appropriate XML +declaration giving the encoding - e.g.

+
<?xml version='1.0' encoding='ISO-8859-15'?>
+
+ + +

However, such XML is less portable than XML in UTF-8. Using the +Model.write(OutputStream …) methods allows the Jena system +code to choose UTF-8 encoding, which is the best choice.

+

RDF/XML, RDF/XML-ABBREV

+

For input, both of these are the same, and fully implement the +RDF Syntax Recommendation, +see conformance.

+

For output, "RDF/XML", produces regular output reasonably +efficiently, but it is not readable. In contrast, +"RDF/XML-ABBREV", produces readable output without much regard to +efficiency.

+

All the readers and writers for RDF/XML are configurable, see +below, input and output.

+

Character Encoding Issues

+

The easiest way to not read or understand this section is always to +use InputStreams and OutputStreams with Jena, and to never use +Readers and Writers. If you do this, Jena will do the right thing, +for the vast majority of users. If you have legacy code that uses +Readers and Writers, or you have special needs with respect to +encodings, then this section may be helpful. The last part of this +section summarizes the character encodings supported by Jena.

+

Character encoding is the way that characters are mapped to bytes, +shorts or ints. There are many different character encodings. +Within Jena, character encodings are important in their +relationship to Web content, particularly RDF/XML files, which +cannot be understood without knowing the character encoding, and in +relationship to Java, which provides support for many character +encodings.

+

The Java approach to encodings is designed for ease of use on a +single machine, which uses a single encoding; often being a +one-byte encoding, e.g. for European languages which do not need +thousands of different characters.

+

The XML approach is designed for the Web which uses multiple +encodings, and some of them requiring thousands of characters.

+

On the Web, XML files, including RDF/XML files, are by default +encoded in "UTF-8" (Unicode). This is always a good choice for +creating content, and is the one used by Jena by default. Other +encodings can be used, but may be less interoperable. Other +encodings should be named using the canonical name registered at +IANA, but other +systems have no obligations to support any of these, other than +UTF-8 and UTF-16.

+

Within Java, encodings appear primarily with the InputStreamReader +and OutputStreamWriter classes, which convert between bytes and +characters using a named encoding, and with their subclasses, +FileReader and FileWriter, which convert between bytes in the file +and characters using the default encoding of the platform. It is +not possible to change the encoding used by a Reader or Writer +while it is being used. The default encoding of the platform +depends on a large range of factors. This default encoding may be +useful for communicating with other programs on the same platform. +Sometimes the default encoding is not registered at IANA, and so +Jena application developers should not use the default encoding for +Web content, but use UTF-8.

+

Encodings Supported in Jena 2.2 and later

+

On RDF/XML input any encoding supported by Java can be used. If +this is not a canonical name registered at IANA a warning message +is produced. Some encodings have better support in Java 1.5 than +Java 1.4; for such encodings a warning message is produced on Java +1.4, suggesting upgrading.

+

On RDF/XML output any encoding supported by Java can be used, by +constructing an OutputStreamWriter using that encoding, and using +that for output. If the encoding is not registered at IANA then a +warning message is produced. Some encodings have better support in +Java 1.5 than Java 1.4; for such encodings a warning message is +produced on Java 1.4, suggesting upgrading.

+

Java can be configured either with or without a jar of extra +encodings on the classpath. This jar is charsets.jar and sits in +the lib directory of the Java Runtime. If this jar is not on your +classpath then the range of encodings supported is fairly small.

+

The encodings supported by Java are listed by Sun, for +1.4.2, +and +1.5.0. +For an encoding that is not in these lists it is possible to write +your own transcoder as documented in the java.nio.charset package +documentation.

+

Earlier versions of Jena supported fewer encodings.

+

When to Use Reader and Writer?

+

Infrequently.

+

Despite the character encoding issues, it is still sometimes +appropriate to use Readers and Writers with Jena I/O. A good +example is using Readers and Writers into StringBuffers in memory. +These do not need to be encoded and decoded so a character encoding +does not need to be specified. Other examples are when an advanced +user explicitly wishes to correctly control the encoding.

+
    +
  • Model read(java.io.Reader reader, java.lang.String base) +
    Using this method is often a mistake.
  • +
  • Model read(java.io.Reader reader, java.lang.String base, java.lang.String lang) +
    Using this method is often a mistake.
  • +
  • Model write(java.io.Writer writer) +
    Caution! Write the model as an XML document.
  • +
  • Model write(java.io.Writer writer, java.lang.String lang) +
    Caution! Write a serialized representation of a model in a specified language.
  • +
  • Model write(java.io.Writer writer, java.lang.String lang, java.lang.String base) +
    Caution! Write a serialized representation of a model in a specified language.
  • +
+

Incorrect use of these read(Reader, …) methods results in +warnings and errors with RDF/XML and RDF/XML-ABBREV (except in a +few cases where the incorrect use cannot be automatically +detected). Incorrect use of the write(Writer, …) methods +results in peculiar XML declarations such as +<?xml version="1.0" encoding="WINDOWS-1252"?>. This would reflect +that the character encoding you used (probably without realizing) +in your Writer is registered with IANA under the name +"WINDOWS-1252". The resulting XML is of reduced portability as a +result. Glenn Marcy +notes:

+
+

since UTF-8 and UTF-16 are the only encodings REQUIRED to be +understood by all conformant XML processors, even ISO-8859-1 would +technically be on shaky ground if not for the fact that it is in +such widespread use that every reasonable XML processor supports +it.With N-TRIPLE incorrect use is usually benign, since N-TRIPLE is +ascii based.

+
+

Character encoding issues of N3 are not well-defined; hence use of +these methods may require changes in the future. Use of the +InputStream and OutputStream methods will allow your code to work +with future versions of Jena which do the right thing - whatever +that is. Currently the OutputStream methods use UTF-8 encoding.

+

Introduction to Advanced Jena I/O

+

The RDF/XML input and output is configurable. +However, to configure it, it is necessary to access an RDFReader or +RDFWriter object that remains hidden in the simpler interface +above.

+

The four vital calls in the Model interface are:

+
    +
  • RDFReader +getReader() +
    Return an RDFReader instance for the default serialization language.
  • +
  • RDFReader +getReader(java.lang.String lang) +
    Return an RDFReader instance for the specified serialization language.
  • +
  • RDFReader +getWriter() +
    Return an RDFWriter instance for the default serialization language.
  • +
  • RDFReader +getWriter(java.lang.String lang) +
    An RDFWriter instance for the specified serialization language.
  • +
+

Each of these calls returns an RDFReader or RDFWriter that can be +used to read or write any Model (not just the one which created +it). As well as the necessary +read +and +write +methods, these interfaces provide:

+ +

Setting properties, or the error handler, on an RDFReader or an +RDFWriter allows the programmer to access non-default behaviour. +Moreover, since the RDFReader and RDFWriter is not bound to a +specific Model, a typical idiom is to create the RDFReader or +RDFWriter on system initialization, to set the appropriate +properties so that it behaves exactly as required in your application, +and then to do all subsequent I/O through it.

+
Model m = Modelfactory.createDefaultModel();
+RDFWriter writer = m.getRDFWriter();
+m = null; // m is no longer needed.
+writer.setErrorHandler(myErrorHandler);
+writer.setProperty("showXmlDeclaration","true");
+writer.setProperty("tab","8");
+writer.setProperty("relativeURIs","same-document,relative");
+…
+Model marray[];
+…
+for (int i=0; i<marray.length; i++) {
+…
+    OutputStream out = new FileOutputStream("foo" + i + ".rdf");
+    writer.write(marray[i],
+                       out,
+      "http://example.org/");
+    out.close();
+}
+
+ + +

Note that all of the current implementations are synchronized, so +that a specific RDFReader cannot be reading two different documents +at the same time. In a multi-threaded application this may suggest a +need for a pool of RDFReaders and/or RDFWriters, or alternatively +to create, initialize, use and discard them as needed.

+

For N-TRIPLE there are currently no properties supported for +either the RDFReader or the RDFWriter. Hence this idiom above is +not very helpful, and just using the Model.write() methods may +prove easier.

+

For RDF/XML and RDF/XML-ABBREV, there are many options in both the +RDFReader and the RDFWriter. N3 has options on the RDFWriter. These +options are detailed below. For RDF/XML they are also found in the +JavaDoc for +JenaReader.setProperty(String, Object) +and +RDFXMLWriterI.setProperty(String, Object).

+

Advanced RDF/XML Input

+

For access to these advanced features, first get an RDFReader +object that is an instance of an ARP parser, by using the +getReader() +method on any Model. It is then configured using the +setProperty(String, Object) +method. This changes the properties for parsing RDF/XML. Many of +the properties change the RDF parser, some change the XML parser. +(The Jena RDF/XML parser, ARP, implements the +RDF grammar +over a Xerces2-J XML +parser). However, changing the features and properties of the XML +parser is not likely to be useful, but was easy to implement.

+

setProperty(String, Object) +can be used to set and get:

+
    +
  • ARP properties +
    These allow fine grain control over the extensive error + reporting capabilities of ARP. And are detailed directly below.
  • +
  • SAX2 features +
    See + Xerces features. + Value should be given as a String "true" or "false" or a Boolean.
  • +
  • SAX2 properties +
    See Xerces properties.
  • +
  • Xerces features +
    See Xerces features. + Value should be given as a String "true" or "false" or a Boolean.
  • +
  • Xerces properties +
    See Xerces properties.
  • +
+

ARP properties

+

An ARP property is referred to either by its property name, (see +below) or by an absolute URL of the form +http://jena.hpl.hp.com/arp/properties/<PropertyName>. The value +should be a String, an Integer or a Boolean depending on the +property.

+

ARP property names and string values are case insensitive.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDescriptionValue classLegal Values
iri-rulesSet the engine for checking and resolving. "strict" sets the IRI engine with rules for valid IRIs, XLink and RDF; it does not permit spaces in IRIs. "iri"sets the IRI engine to IRI (RFC 3986, RFC 3987) . The default is "lax"(for backwards compatibility), the rules for RDF URI references only, which does permit spaces although the use of spaces is not good practice.Stringlax
strict
iri
error-modeARPOptions.setDefaultErrorMode()
ARPOptions.setLaxErrorMode()
ARPOptions.setStrictErrorMode()
ARPOptions.setStrictErrorMode(int)
This allows a coarse-grained approach to control of error handling. Setting this property is equivalent to setting many of the fine-grained error handling properties.
Stringdefault
lax
strict
strict-ignore
strict-warning
strict-error
strict-fatal
embeddingARPOptions.setEmbedding(boolean)
This sets ARP to look for RDF embedded within an enclosing XML document.
String or Booleantrue
false
ERR_<XXX>
WARN_<XXX>
IGN_<XXX>
See ARPErrorNumbers for a complete list of the error conditions detected. Setting one of these properties is equivalent to the method ARPOptions.setErrorMode(int, int). Thus fine-grained control over the behaviour in response to specific error conditions is possible.String or IntegerEM_IGNORE
EM_WARNING
EM_ERROR
EM_FATAL
+

As an example, if you are working in an environment with legacy RDF +data that uses unqualified RDF attributes such as "about" instead +of "rdf:about", then the following code is appropriate:

+
Model m = Modelfactory.createDefaultModel();
+RDFReader arp = m.getReader();
+m = null; // m is no longer needed.
+// initialize arp
+// Do not warn on use of unqualified RDF attributes.
+arp.setProperty("WARN_UNQUALIFIED_RDF_ATTRIBUTE","EM_IGNORE");
+
+…
+
+InputStream in = new FileInputStream(fname);
+arp.read(m,in,url);
+in.close();
+
+ + +

As a second example, suppose you wish to work in strict mode, but +allow "daml:collection", the following works:

+
 …
+ arp.setProperty("error-mode", "strict" );
+ arp.setProperty("IGN_DAML_COLLECTION","EM_IGNORE");
+ …
+
+ + +

The other way round does not work.

+
 …
+ arp.setProperty("IGN_DAML_COLLECTION","EM_IGNORE");
+ arp.setProperty("error-mode", "strict" );
+ …
+
+ + +

This is because in strict mode +IGN_DAML_COLLECTION +is treated as an error, and so the second call to setProperty +overwrites the effect of the first.

+

The IRI rules and resolver can be set on a per-reader basis:

+
InputStream in = ... ;
+String baseURI = ... ;
+Model model = Modelfactory.createDefaultModel();
+RDFReader r = model.getReader("RDF/XML");
+r.setProperty("iri-rules", "strict") ;
+r.setProperty("error-mode", "strict") ; // Warning will be errors.
+
+// Alternative to the above "error-mode": set specific warning to be an error.
+//r.setProperty( "WARN_MALFORMED_URI", ARPErrorNumbers.EM_ERROR) ;
+r.read(model, in, baseURI) ;
+in.close();
+
+ + +

The global default IRI engine can be set with:

+
ARPOptions.setIRIFactoryGlobal(IRIFactory.iriImplementation()) ;
+
+ + +

or other IRI rule engine from IRIFactory.

+

Interrupting ARP

+

ARP can be interrupted using the Thread.interrupt() method. This +causes an +ERR_INTERRUPTED +error during the parse, which is usually treated as a fatal error.

+

Here is an illustrative code sample:

+
ARP a = new ARP();
+final Thread arpt = Thread.currentThread();
+Thread killt = new Thread(new Runnable() {
+     public void run() {
+       try {
+          Thread.sleep(tim);
+       } catch (InterruptedException e) {
+       }
+       arpt.interrupt();
+     }
+  });
+killt.start();
+try {
+  in = new FileInputStream(fileName);
+  a.load(in);
+  in.close();
+  fail("Thread was not interrupted.");
+} catch (SAXParseException e) {
+}
+
+ + +

Advanced RDF/XML Output

+

The first RDF/XML output question is whether to use the "RDF/XML" +or RDF/XML-ABBREV writer. While some of the code is shared, these +two writers are really very different, resulting in different but +equivalent output. RDF/XML-ABBREV is slower, but should produce +more readable XML.

+

For access to advanced features, first get an RDFWriter object, of +the appropriate language, by using +getWriter("RDF/XML") +or +getWriter("RDF/XML-ABBREV") +on any Model. It is then configured using the +setProperty(String, Object) +method. This changes the properties for writing RDF/XML.

+

Properties to Control RDF/XML Output

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +  + + + + + +
Property NameDescriptionValue classLegal Values
xmlbaseThe value to be included for an xml:base attribute on the root element in the file.StringA URI string, or null (default)
longIdWhether to use long or short id's for anon resources. Short id's are easier to read and are the default, but can run out of memory on very large models.String or Boolean"true", "false" (default)
allowBadURIsURIs in the graph are, by default, checked prior to serialization.String or Boolean"true", "false" (default)
relativeURIsWhat sort of relative URIs should be used. A comma separated list of options:

+
    +
  • same-document
    + same-document references (e.g. "" or "#foo")
  • +
  • network
    + network paths e.g. "//example.org/foo" omitting the URI scheme
  • +
  • absolute
    + absolute paths e.g. "/foo" omitting the scheme and authority
  • +
  • relative
    + relative path not beginning in "../"
  • +
  • parent
    + relative path beginning in "../"
  • +
  • grandparent
    + relative path begining in "../../"
  • +
+

The default value is "same-document, absolute, relative, parent". +To switch off relative URIs use the value "". Relative URIs of any +of these types are output where possible if and only if the option +has been specified.

String
showXmlDeclaration +If true, an XML Declaration is included in the output, if false no XML declaration is included. +The default behaviour only gives an XML Declaration when asked to write to an OutputStreamWriter +that uses some encoding other than UTF-8 or UTF-16. In this case the encoding is shown in the +XML declaration. To ensure that the encoding attribute is shown in the XML declaration either:

+
    +
  • Set this option to true and use the + write(Model,Writer,String) variant with an appropriate OutputStreamWriter.
  • +
  • +

    Or set this option to false, and write the declaration to an OutputStream before calling + write(Model,OutputStream,String). +

true, "true", false, "false" or "default"can be true, false or "default" (null)
showDoctypeDeclaration +If true, an XML Doctype declaration is included in the output. This +declaration includes a !ENTITY declaration for each prefix mapping +in the model, and any attribute value that starts with the URI of +that mapping is written as starting with the corresponding entity +invocation. +String or Booleantrue, false, "true", "false"
tabThe number of spaces with which to indent XML child elements.String or Integerpositive integer "2" is the default
attributeQuoteCharHow to write XML attributes.String"\"" or "'"
blockRules +A list of Resource or a String being a comma separated list of +fragment IDs from http://www.w3.org/TR/rdf-syntax-grammar +indicating grammar rules that will not be used. Rules that can be blocked are:

+ +
  • +

    section-Reification + (RDFSyntax.sectionReification)

    +
  • +
  • section-List-Expand + (RDFSyntax.sectionListExpand)
  • +
  • parseTypeLiteralPropertyElt + (RDFSyntax.parseTypeLiteralPropertyElt)
  • +
  • parseTypeResourcePropertyElt + (RDFSyntax.parseTypeLiteralPropertyElt)
  • +
  • parseTypeCollectionPropertyElt + (RDFSyntax.parseTypeCollectionPropertyElt)
  • +
  • idAttr + (RDFSyntax.idAttr)
  • +
  • propertyAttr + (RDFSyntax.propertyAttr)
  • + +

    In addition "daml:collection" +(DAML_OIL.collection) +can be blocked. Blocking +idAttr also +blocks +section-Reification. +By default, rule +propertyAttr +is blocked. For the basic writer (RDF/XML) only +parseTypeLiteralPropertyElt +has any effect, since none of the other rules are implemented by +that writer. +

    Resource[] or String
    prettyTypes +Only for the RDF/XML-ABBREV writer. This is a list of the types of +the principal objects in the model. The writer will tend to create +RDF/XML with resources of these types at the top level. + +Resource[] +
    +  +As an example,

    +
    RDFWriter w = m.getWriter("RDF/XML-ABBREV");
    +w.setProperty("attribtueQuoteChar","'");
    +w.setProperty("showXMLDeclaration","true");
    +w.setProperty("tab","1");
    +w.setProperty("blockRules",
    +  "daml:collection,parseTypeLiteralPropertyElt,"
    +  +"parseTypeResourcePropertyElt,parseTypeCollectionPropertyElt");
    +
    + + +

    creates a writer that does not use rdf:parseType (preferring +rdf:datatype for rdf:XMLLiteral), indents only a little, and +produces the XMLDeclaration. Attributes are used, and are quoted +with "'".

    +

    Note that property attributes are not used at all, by default. +However, the RDF/XML-ABBREV writer includes a rule to produce +property attributes when the value does not contain any spaces. +This rule is normally switched off. This rule can be turned on +selectively by using the blockRules property as detailed above.

    +

    Conformance

    +

    The RDF/XML I/O endeavours to conform with the +RDF Syntax Recommendation.

    +

    The parser must be set to strict mode. (Note that, the conformant +behaviour for rdf:parseType="daml:collection" is to silently turn +"daml:collection" into "Literal").

    +

    The RDF/XML writer is conformant, but does not exercise much of the +grammar.

    +

    The RDF/XML-ABBREV writer exercises all of the grammar and is +conformant except that it uses the daml:collection construct for +DAML ontologies. This non-conformant behaviour can be switched off +using the blockRules property.

    +

    Faster RDF/XML I/O

    +

    To optimise the speed of writing RDF/XML it is suggested that all +URI processing is turned off. Also do not use RDF/XML-ABBREV. It is +unclear whether the longId attribute is faster or slower; the short +IDs have to be generated on the fly and a table maintained during +writing. The longer IDs are long, and hence take longer to write. +The following creates a faster writer:

    +
    Model m;
    +…
    +…
    +RDFWriter fasterWriter = m.getWriter("RDF/XML");
    +fasterWriter.setProperty("allowBadURIs","true");
    +fasterWriter.setProperty("relativeURIs","");
    +fasterWriter.setProperty("tab","0");
    +
    + + +

    When reading RDF/XML the check for reuse of rdf:ID has a memory +overhead, which can be significant for very large files. In this +case, this check can be suppressed by telling ARP to ignore this +error.

    +
    Model m;
    +…
    +…
    +RDFReader bigFileReader = m.getReader("RDF/XML");
    +bigFileReader.setProperty("WARN_REDEFINITION_OF_ID","EM_IGNORE");
    +…
    +
    +
    + + + + + Added: websites/staging/jena/trunk/content/documentation/io/arp_sax.html ============================================================================== --- websites/staging/jena/trunk/content/documentation/io/arp_sax.html (added) +++ websites/staging/jena/trunk/content/documentation/io/arp_sax.html Tue Jan 1 15:28:10 2013 @@ -0,0 +1,349 @@ + + + + + + + + Apache Jena - SAX Input into Jena and ARP + + + + + + + + + + +
    +

    SAX Input into Jena and ARP

    +

    Normally, both ARP and Jena are used to read files either from the +local machine or from the Web. A different use case, addressed +here, is when the XML source is available in-memory in some way. In +these cases, ARP and Jena can be used as a SAX event handler, +turning SAX events into triples, or a DOM tree can be parsed into a +Jena Model.

    +

    Contents

    + +

    1. Overview

    +

    To read an arbitrary SAX source as triples to be added into a Jena +model, it is not possible to use a +Model.read() +operation. Instead, you construct a SAX event handler of class +SAX2Model, +using the +create +method, install these as the handler on your SAX event source, and +then stream the SAX events. It is possible to have fine-grained +control over the SAX events, for instance, by inserting or deleting +events, before passing them to the +SAX2Model +handler.

    +

    Sample Code

    +

    This code uses the Xerces parser as a SAX event stream, and adds +the triple to a +Model using +default options.

    +
    // Use your own SAX source.
    +XMLReader saxParser = new SAXParser();
    +
    +// set up SAX input
    +InputStream in = new FileInputStream("kb.rdf");
    +InputSource ins = new InputSource(in);
    +ins.setSystemId(base);
    +
    +Model m = ModelFactory.createDefaultModel();
    +String base = "http://example.org/";
    +
    +// create handler, linked to Model
    +SAX2Model handler = SAX2Model.create(base, m);
    +
    +// install handler on SAX event stream
    +SAX2RDF.installHandlers(saxParser, handler);
    +
    +try {
    +    try {
    +        saxParser.parse(ins);
    +    } finally {
    +        // MUST ensure handler is closed.
    +        handler.close();
    +    }
    +} catch (SAXParseException e) {
    +    // Fatal parsing errors end here,
    +    // but they will already have been reported.
    +}
    +
    + + +

    Initializing SAX event source

    +

    If your SAX event source is a subclass of XMLReader, then the +installHandlers +static method can be used as shown in the sample. Otherwise, you +have to do it yourself. The +installHandlers +code is like this:

    +
    static public void installHandlers(XMLReader rdr, XMLHandler sax2rdf)
    +throws SAXException
    +{
    +    rdr.setEntityResolver(sax2rdf);
    +    rdr.setDTDHandler(sax2rdf);
    +    rdr.setContentHandler(sax2rdf);
    +    rdr.setErrorHandler(sax2rdf);
    +    rdr.setFeature("http://xml.org/sax/features/namespaces", true);
    +    rdr.setFeature(
    +            "http://xml.org/sax/features/namespace-prefixes",
    +            true);
    +    rdr.setProperty(
    +            "http://xml.org/sax/properties/lexical-handler",
    +            sax2rdf);
    +}
    +
    + + +

    For some other SAX source, the exact code will differ, but the +required operations are as above.

    +

    Error Handler

    +

    The SAX2Model +handler supports the +setErrorHandler +method, from the Jena +RDFReader +interface. This is used in the same way as that method to control +error reporting.

    +

    A specific fatal error, new in Jena 2.3, is ERR_INTERRUPTED, which +indicates that the current Thread received an interrupt. This +allows long jobs to be aborted on user request.

    +

    Options

    +

    The SAX2Model +handler supports the +setProperty +method, from the Jena +RDFReader +interface. This is used in nearly the same way to have fine grain +control over ARPs behaviour, particularly over error reporting, see +the I/O howto. Setting SAX or +Xerces properties cannot be done using this method.

    +

    XML Lang and Namespaces

    +

    If you are only treating some document subset as RDF/XML then it is +necessary to ensure that ARP knows the correct value for xml:lang +and desirable that it knows the correct mappings of namespace +prefixes.

    +

    There is a second version of the +create +method, which allows specification of the xml:lang value from the +outer context. If this is inappropriate it is possible, but hard +work, to synthesis an appropriate SAX event.

    +

    For the namespaces prefixes, it is possible to call the +startPrefixMapping +SAX event, before passing the other SAX events, to declare each +namespace, one by one. Failure to do this is permitted, but, for +instance, a Jena Model will then not know the (advisory) namespace +prefix bindings. These should be paired with endPrefixMapping +events, but nothing untoward is likely if such code is omitted.

    +

    Using your own triple handler

    +

    As with ARP, it is possible to use this functionality, without +using other Jena features, in particular, without using a Jena +Model. Instead of using the class SAX2Model, you use its superclass +SAX2RDF. The +create +method on this class does not provide any means of specifying what +to do with the triples. Instead, the class implements the +ARPConfig +interface, which permits the setting of handlers and parser +options, as described in the documentation for using +ARP without Jena.

    +

    Thus you need to:

    +
      +
    1. Create a SAX2RDF using + SAX2RDF.create()
    2. +
    3. Attach your StatementHandler and SAXErrorHandler and optionally + your NamespaceHandler and ExtendedHandler to the SAX2RDF instance.
    4. +
    5. Install the SAX2RDF instance as the SAX handler on your SAX + source.
    6. +
    7. Follow the remainder of the code sample above.
    8. +
    +

    Using a DOM as Input

    +

    None of the approaches listed here work with Java 1.4.1_04. We +suggest using Java 1.4.2_04 or greater for this functionality. +This issue has no impact on any other Jena functionality.

    +

    Using a DOM as Input to Jena

    +

    The DOM2Model +subclass of SAX2Model, allows the parsing of a DOM using ARP. The +procedure to follow is:

    +
      +
    • Construct a DOM2Model, using a factory method such as + createD2M, + specifying the xml:base of the document to be loaded, the Model to + load into, optionally the xml:lang value (particularly useful if + using a DOM Node from within a Document).
    • +
    • Set any properties, error handlers etc. on the DOM2Model + object.
    • +
    • The DOM is parsed simply by calling the + load(Node) + method.
    • +
    +

    Using a DOM as Input to ARP

    +

    DOM2Model is a subclass of SAX2RDF, and handlers etc. can be set on +the DOM2Model as for SAX2RDF. Using a null model as the argument to +the factory indicates this usage.

    +
    + + + + +