incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Allen (JIRA)" <>
Subject [jira] [Updated] (JENA-85) Common bindings I/O
Date Tue, 09 Aug 2011 00:38:28 GMT


Stephen Allen updated JENA-85:

    Attachment: JENA-85-Blank-Node-Test.patch

I'm having some issues with blank nodes not coming back with the same internal label.  I've
attached a testcase that fails when writing and then reading back in a binding with a blank
node [1].

The issue seems to be the serializer is writing a mapped blank node label (like _:b0) instead
of the internal label.

[1] JENA-85-Blank-Node-Test.patch

> Common bindings I/O
> -------------------
>                 Key: JENA-85
>                 URL:
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Paolo Castagna
>         Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch
> ( from: )
> There are a number of activities that require being about to serialize, and read back,
bindings.  They use different serializations.  A shared "bindings I/O" would mean all activities
could use one, tuned, set of serialization and I/O classes.
> JENA-44 (External sort) encodes a binding as a length-denoted byte array.  The byte arry
uses lengh-denoted byte arrays within the bindings.  I/O is done using Data(In|Out)putStream,
specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[])
for the per-row serialization as (var,Turtle string form) pairs.  It uses a null for no such
> JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding
endcoded as (var, Turtle term). End of row is denoted by a DOT.  It uses modified RIOT for
input reading.
> There is also use of TSV I/O for writing and reading result sets.  In this form, the
variables are written once at the start, and not in each line.
> == Proposed mini-language
> This proposal takes those separate designs, and adds high-level compression.
> A sequence of bindings is written assuming there is a list of variables in force.  Position
in the row determines which variable is bound to which variable (=> compression of variable
names).  Turtle-style prefixes can be used (=> compression for IRIs) and the value of a
slot in a row can "same as the row before" (=> compression for repeated terms) or undefined.
> Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated
data and bugs.
> Every row is the length, in number of terms, as the list variables in force.
> Directives are lines starting with a keyword.  End on DOT.
> The directives are:
>   PREFIX : <http://example> .
>   Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
>   VARS ?x ?y .
>   Set the variables in force for subsequent rows,
>   until the next VARS directive.
>   We need VARS because it's not always possible to determine all
>   the possible variables before starting to write out bindings.
> A binding row is a sequence of terms, encoded like Turtle, including prefixed names and
short forms for numbers (more compression).  In addition STAR ("*") means "same term as the
row before" and DASH ("-") means undef.  Don't use * for - from previous row.
> Rows end in DOT. Preferred style is one space after each term.  This makes writing safe.
> Terms can be written without intermediate copies (except local name processing) or buffers.
 The OutputLangUtils does not do this currently but it should.
> For presentation reasons only, blank lines are allowed (this would all get lost in the
lexing/tokenization anyway).
> Example:
> -------------
> VARS ?x ?y .
> PREFIX : <http://example/> .
> :local1 <http://example.other/text> .
> * - .
> * 123 .
> -------------
> == Discussion
> The format is text - but we're writing strings anyway so a binary form, rather than a
delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars
stuff without intermediate copies
> This would all be hidden behind interface anyway.  A binary tokenizer and binary OutputLangUtils
would enable binary output.
> Dynamic choosing of prefixes can be done. 

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message