incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject JENA-44, JENA-45 etc - common Binding I/O
Date Thu, 23 Jun 2011 12:49:57 GMT
== A Design for a Persistent Bindings Mini-language

There are a number of activities that require being about to serialize, 
and read back, bindings.  They use different serializations.  A shared 
"bindings I/O" would mean all activities could use one, tuned, set of 
serialization and I/O classes.

JENA-44 (External sort) encodes a binding as a length-denoted byte 
array.  The byte arry uses lengh-denoted byte arrays within the 
bindings.  I/O is done using Data(In|Out)putStream, specifically. 
putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and 
put/get(byte[]) for the per-row serialization as (var,Turtle string 
form) pairs.  It uses a null for no such value.

JENA-45 (Spill to disk SPARQL Update) uses a more textual representation 
based on a binding endcoded as (var, Turtle term). End of row is denoted 
by a DOT.  It uses modified RIOT for input reading.

There is also use of TSV I/O for writing and reading result sets.  In 
this form, the variables are written once at the start, and not in each 

== Proposed mini-language

This proposal takes those separate designs, and adds high-level compression.

A sequence of bindings is written assuming there is a list of variables 
in force.  Position in the row determines which variable is bound to 
which variable (=> compression of variable names).  Turtle-style 
prefixes can be used (=> compression for IRIs) and the value of a slot 
in a row can "same as the row before" (=> compression for repeated 
terms) or undefined.

Rows end in a DOT - this is not stricly necessary but adds a robustness 
against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.

Directives are lines starting with a keyword.  End on DOT.

The directives are:

   PREFIX : <http://example> .

   Like Turtles, except keyword based to fit with being a keyword-driven 

   VARS ?x ?y .

   Set the variables in force for subsequent rows,
   until the next VARS directive.
   We need VARS because it's not always possible to determine all
   the possible variables before starting to write out bindings.

A binding row is a sequence of terms, encoded like Turtle, including 
prefixed names and short forms for numbers (more compression).  In 
addition STAR ("*") means "same term as the row before" and DASH ("-") 
means undef.  Don't use * for - from previous row.

Rows end in DOT. Preferred style is one space after each term.  This 
makes writing safe.

Terms can be written without intermediate copies (except local name 
processing) or buffers.  The OutputLangUtils does not do this currently 
but it should.

For presentation reasons only, blank lines are allowed (this would all 
get lost in the lexing/tokenization anyway).


VARS ?x ?y .
PREFIX : <http://example/> .
:local1 <http://example.other/text> .
* - .
* 123 .

== Discussion

The format is text - but we're writing strings anyway so a binary form, 
rather than a delimited text form, is unlikely to give much advantage 
but can't reuse the standard bytes<->chars stuff without intermediate copies

This would all be hidden behind interface anyway.  A binary tokenizer 
and binary OutputLangUtils would enable binary output.

Dynamic choosing of prefixes can be done.

View raw message