incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Re: TDB Literal Canonicalization
Date Fri, 12 Aug 2011 10:13:17 GMT
The reply to Ian is the current state.

It could be changed - take a more value-oriented appraoch through out.

(longer term thinking out loud, not plans, nor likely next steps).

1/ RIOT parsers could canonicalize data.

This is a possible approach to simple literals/xsd:strings for RDF 1.1 

We could canonicalize to xsd:decimal, or canonicalize integer valued 
decimals to integer.


XSD 1.0 -> XSD 1.1 changes the canonical lexical form of integer-valued 
decimals from 78.0 to 78.

Potential parsing costs [*]

2/ ARQ/TDB query execution could specially handle XSD values to look for 


{ ?x :p 123 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
{ ?x :p 123.0 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }

It's rather easier for constants.

{ ?x :p1 ?v ; :p2 ?v . } and doing value equality is doable, quite 
easily with an index join, but I'd need to think more about merge joins 
(not currently used anyway).

Any and all random thoughts and comments welcome - I guess the real 
issue if to decide a policy for Jena.

How much to work in terms of "value" andhow much to work preserving the 
representational differences.  e.g. This can change COUNT() results.


[*] On N-triples loading:

When loading at scale, this is a possible appreciable cost.  The 
N-triples load path is already fairly stream-lined and a extra step of 
check-copy may be a visible cost.  N-triples parsing is not strongly I/O 
- it reads large chunks of the streaming fashion and files tend to be 
generated all at once, causing the disk blocks to laid out nicely.

Costs may be offset by some concurrent processing - I did do one simple 
experiment and found that concurrent was faster, so concurrency costs 
were not bigger than gains by using more threads.

On 12/08/11 10:03, Andy Seaborne wrote:
> On 11/08/11 22:41, Ian Emmons wrote:
>> TDB experts,
>> At [1], the TDB documentation indicates that TDB will regard
>> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
>> them in a query. However, when I store the former and query for the
>> latter, TDB does not return the expected result.
> TDB stores the values of integer and decimal, but it does stil keep
> those two types part. The rules of XSD arithmetic try not to over
> promote datatypes e.g. integer + integer is integer.
> I guess "by query" you are putting the decimal directly in a graph
> pattern. They are the same value in FILTERs.
>> I've attached a small sample program and the .ttl file that it reads
>> so that you can reproduce the problem. My question is, what am I
>> doing wrong, here?
> The attachments are empty - and indeed the [1] link is in the second
> attachment. I can send you the raw source of the message I received if
> that helps.
> Andy
>> Thanks,
>> Ian
>> [1]

View raw message