Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CE9DE9F3C for ; Wed, 21 Mar 2012 19:59:02 +0000 (UTC) Received: (qmail 80913 invoked by uid 500); 21 Mar 2012 19:59:02 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 80870 invoked by uid 500); 21 Mar 2012 19:59:02 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 80860 invoked by uid 99); 21 Mar 2012 19:59:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 19:59:02 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 19:59:01 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 387DD1B6AB2 for ; Wed, 21 Mar 2012 19:58:41 +0000 (UTC) Date: Wed, 21 Mar 2012 19:58:41 +0000 (UTC) From: "Andy Seaborne (Updated) (JIRA)" To: jena-dev@incubator.apache.org Message-ID: <1940268217.43901.1332359921269.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1461420273.42964.1332352781011.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (JENA-225) TDB datasets can be corrupted by performing certain operations within a transaction MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Seaborne updated JENA-225: ------------------------------- Attachment: JENA-225-v1.patch Potential fix that sets the charset encoder/decoder to replace bad codepoints with the default replacement char (a '?'). Caveat: String data does not round trip, hashing and equality change, caches may be affected (needs checking). > TDB datasets can be corrupted by performing certain operations within a transaction > ------------------------------------------------------------------------------------ > > Key: JENA-225 > URL: https://issues.apache.org/jira/browse/JENA-225 > Project: Apache Jena > Issue Type: Bug > Affects Versions: TDB 0.9.0 > Environment: jena-tdb-0.9.0-incubating > Reporter: Sam Tunnicliffe > Attachments: JENA-225-v1.patch, ReportBadUnicode1.java > > > In a web application, we read some triples in a HTTP POST, using a LangTurtle instance and a tokenizer obtained from from TokenizerFactory.makeTokenizerUTF8. > We then write the parsed Triples back out (to temporary storage) using OutputLangUtils.write. At some later time, these Triples are then re-read, again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being inserted into a TDB dataset. > We have found it possible for the the input data to contain character strings which pass through the various parsers/serializers but which cause TDB's transaction layer to error in such a way as to make recovery from journals ineffective. > Eliminating transactions from the code path enables the database to be updated successfully. > The stacktrace from TDB looks like this: > org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello > at org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209) > at org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620) > at org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248) > at org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112) > at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105) > at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93) > at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234) > at com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228) > at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301) > at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188) > at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306) > at com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266) > at com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131) > at com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112) > at com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40) > at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106) > at com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60) > at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143) > At least part of the issue seems to be stem from NodecSSE (I know this isn't actual unicode escaping, but its derived from the user input we've received). > String s = "Hello \uDAE0 World"; > Node literal = Node.createLiteral(s); > ByteBuffer bb = NodeLib.encode(literal); > NodeLib.decode(bb); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira