Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 47C187BC0 for ; Tue, 9 Aug 2011 15:48:52 +0000 (UTC) Received: (qmail 88360 invoked by uid 500); 9 Aug 2011 15:48:52 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 88336 invoked by uid 500); 9 Aug 2011 15:48:51 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 88328 invoked by uid 99); 9 Aug 2011 15:48:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 15:48:51 +0000 X-ASF-Spam-Status: No, hits=-2000.8 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 15:48:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 44969B3CEE for ; Tue, 9 Aug 2011 15:48:27 +0000 (UTC) Date: Tue, 9 Aug 2011 15:48:27 +0000 (UTC) From: "Andy Seaborne (JIRA)" To: jena-dev@incubator.apache.org Message-ID: <1495633508.20469.1312904907277.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1377138058.4044.1312357647136.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (JENA-85) Common bindings I/O MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/JENA-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081709#comment-13081709 ] Andy Seaborne commented on JENA-85: ----------------------------------- Sink patch applied - also have write(Binding) == send(Binding) for familiarity of naming. For the bNodes, a complication is that just writing "_:bnodelabel" isn't legal. The tokenizer needs reversible bNode mapping. An approach is to have the <_:label> synatx should be enabled for input and output. > Common bindings I/O > ------------------- > > Key: JENA-85 > URL: https://issues.apache.org/jira/browse/JENA-85 > Project: Jena > Issue Type: New Feature > Components: ARQ > Reporter: Paolo Castagna > Attachments: JENA-85-BindingOutputStream-Changes.patch, JENA-85-Blank-Node-Test.patch > > > ( from: http://markmail.org/thread/ljjrsiun3oxtrchw ) > There are a number of activities that require being about to serialize, and read back, bindings. They use different serializations. A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes. > JENA-44 (External sort) encodes a binding as a length-denoted byte array. The byte arry uses lengh-denoted byte arrays within the bindings. I/O is done using Data(In|Out)putStream, specifically. putInt/getInt() and put/get(byte[]) and ByteBuffer putInt/getInt() and put/get(byte[]) for the per-row serialization as (var,Turtle string form) pairs. It uses a null for no such value. > JENA-45 (Spill to disk SPARQL Update) uses a more textual representation based on a binding endcoded as (var, Turtle term). End of row is denoted by a DOT. It uses modified RIOT for input reading. > There is also use of TSV I/O for writing and reading result sets. In this form, the variables are written once at the start, and not in each line. > == Proposed mini-language > This proposal takes those separate designs, and adds high-level compression. > A sequence of bindings is written assuming there is a list of variables in force. Position in the row determines which variable is bound to which variable (=> compression of variable names). Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot in a row can "same as the row before" (=> compression for repeated terms) or undefined. > Rows end in a DOT - this is not stricly necessary but adds a robustness against truncated data and bugs. > Every row is the length, in number of terms, as the list variables in force. > Directives are lines starting with a keyword. End on DOT. > The directives are: > PREFIX : . > Like Turtles, except keyword based to fit with being a keyword-driven mini-language. > VARS ?x ?y . > Set the variables in force for subsequent rows, > until the next VARS directive. > We need VARS because it's not always possible to determine all > the possible variables before starting to write out bindings. > A binding row is a sequence of terms, encoded like Turtle, including prefixed names and short forms for numbers (more compression). In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef. Don't use * for - from previous row. > Rows end in DOT. Preferred style is one space after each term. This makes writing safe. > Terms can be written without intermediate copies (except local name processing) or buffers. The OutputLangUtils does not do this currently but it should. > For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway). > Example: > ------------- > VARS ?x ?y . > PREFIX : . > :local1 . > * - . > * 123 . > ------------- > == Discussion > The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies > This would all be hidden behind interface anyway. A binary tokenizer and binary OutputLangUtils would enable binary output. > Dynamic choosing of prefixes can be done. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira