incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Seaborne (Commented) (JIRA)" <>
Subject [jira] [Commented] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction
Date Fri, 23 Mar 2012 11:49:27 GMT


Andy Seaborne commented on JENA-225:

The patch should stop DB crashes, but I'm not sure what's going to happen if the lexical form
is mangled by the Java decoder and it puts a "?" in.  That changes it's java hash and it's
MD5 hash so it might lead to inconsistency.

So the proper fix is to use a codec that is binary-robust.  BlockUTF8 has had tests added
and is now aligned to the way Java handles codepoint 0 (illegal in unicode, java encodes it
as (char)0, modified UTF-8 uses a pair xC0 x80. However TDB only needs the cycle chars->bytes->chars
to work and this variance only affects the bytes->chars->bytes round trip.
> TDB datasets can be corrupted by performing certain operations within a transaction 
> ------------------------------------------------------------------------------------
>                 Key: JENA-225
>                 URL:
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: JENA-225-v1.patch,
> In a web application, we read some triples in a HTTP POST, using a LangTurtle instance
and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write.
At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8,
before being inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings which pass
through the various parsers/serializers but which cause TDB's transaction layer to error in
such a way as to make recovery from journals ineffective. 
> Eliminating transactions from the code path enables the database to be updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
> 	at org.openjena.riot.tokens.TokenizerText.exception(
> 	at org.openjena.riot.tokens.TokenizerText.readString(
> 	at org.openjena.riot.tokens.TokenizerText.parseToken(
> 	at org.openjena.riot.tokens.TokenizerText.hasNext(
> 	at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(
> 	at com.hp.hpl.jena.tdb.lib.NodeLib.decode(
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(
> 	at org.openjena.atlas.iterator.Iter$
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(
> 	at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(
> 	at com.hp.hpl.jena.tdb.transaction.Transaction.commit(
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(
> 	at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(
> 	at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(
> 	at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(
> At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode
escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message