incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Re: JENA-44, JENA-45 etc - common Binding I/O
Date Thu, 23 Jun 2011 16:50:31 GMT
Hi Stephen,

On 23/06/11 16:29, Stephen Allen wrote:
> Hi Andy,
> This looks like it will be useful in a lot of places.
> I'm not 100% sure of what you mean by "dynamic choosing of prefixes".  Does that mean
PREFIX directives are allowed at any point like VARS (so as to only add prefixes as you discover

Choosing them while writing, not requiring them to be given at the start 
of the process of writing.  Making it a requirement to be given a prefix 
mapping means that the system has to have one around at the right point 
but prefix mappings are "just" syntax and often, deep in the system, 
it's not easily available.  And it may be on disk in TDB.

As an example: suppose we decide to only ever write prefixed names.

Then the write keeps a Prerfixmap and on row, determines if there is a 
suitable prefix in the usual manner.  If not, create one, add to 
internal PrefixMap, output a PREFIX ... directive, then output the row 

This means that a stream gets prefix name compression without having to 
be given a prefix map.

While we're at it, we might we well design in a BASE directive.

TokenizerText already knows about directives as @directive.  We could 
use that but I don't see much point in punning a form of Turtle and this 
format.  And @directive is too close to a language tag IMO.

> What is the encoding specification for blank nodes?

We certainly need to be able preserve the label of a blank node :-)

There is also ready machinery for this which is why it's not 
specifically mentioned.  It "just happens".

We will be able to preserve blank node labels by using the <_:label> 
form of pseudo URIs.

Or set the LabelToNode mapping to work on _: form 
(LabelToNode.createUseLabelAsGiven) -- either way, we can do both label 
preserving and bindingset-scoped label management.


> -Stephen
> -----Original Message-----
> From: Andy Seaborne []
> Sent: Thursday, June 23, 2011 8:50 AM
> To:
> Subject: JENA-44, JENA-45 etc - common Binding I/O
> == A Design for a Persistent Bindings Mini-language
> There are a number of activities that require being about to serialize,
> and read back, bindings.  They use different serializations.  A shared
> "bindings I/O" would mean all activities could use one, tuned, set of
> serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte
> array.  The byte arry uses lengh-denoted byte arrays within the
> bindings.  I/O is done using Data(In|Out)putStream, specifically.
> putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and
> put/get(byte[]) for the per-row serialization as (var,Turtle string
> form) pairs.  It uses a null for no such value.
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation
> based on a binding endcoded as (var, Turtle term). End of row is denoted
> by a DOT.  It uses modified RIOT for input reading.
> There is also use of TSV I/O for writing and reading result sets.  In
> this form, the variables are written once at the start, and not in each
> line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables
> in force.  Position in the row determines which variable is bound to
> which variable (=>  compression of variable names).  Turtle-style
> prefixes can be used (=>  compression for IRIs) and the value of a slot
> in a row can "same as the row before" (=>  compression for repeated
> terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness
> against truncated data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>     PREFIX :<http://example>  .
>     Like Turtles, except keyword based to fit with being a keyword-driven
> mini-language.
>     VARS ?x ?y .
>     Set the variables in force for subsequent rows,
>     until the next VARS directive.
>     We need VARS because it's not always possible to determine all
>     the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including
> prefixed names and short forms for numbers (more compression).  In
> addition STAR ("*") means "same term as the row before" and DASH ("-")
> means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This
> makes writing safe.
> Terms can be written without intermediate copies (except local name
> processing) or buffers.  The OutputLangUtils does not do this currently
> but it should.
> For presentation reasons only, blank lines are allowed (this would all
> get lost in the lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX :<http://example/>  .
> :local1<http://example.other/text>  .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form,
> rather than a delimited text form, is unlikely to give much advantage
> but can't reuse the standard bytes<->chars stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer
> and binary OutputLangUtils would enable binary output.
> Dynamic choosing of prefixes can be done.

View raw message