Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 20152 invoked from network); 31 Mar 2008 20:47:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Mar 2008 20:47:17 -0000 Received: (qmail 40567 invoked by uid 500); 31 Mar 2008 20:47:16 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 40544 invoked by uid 500); 31 Mar 2008 20:47:16 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 40535 invoked by uid 99); 31 Mar 2008 20:47:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 13:47:16 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alessandro.bologna@gmail.com designates 216.239.58.189 as permitted sender) Received: from [216.239.58.189] (HELO gv-out-0910.google.com) (216.239.58.189) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 20:46:36 +0000 Received: by gv-out-0910.google.com with SMTP id e6so289427gvc.18 for ; Mon, 31 Mar 2008 13:46:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; bh=WaAUe9e6Gwg9A0fkx4TNezCjG6hBbK71x+5fJUUt6QA=; b=T4SfaUuxD3ISnXD5vL+wGVzqZV3JoESvGrDdMsHokTMq6bsdC4xdOY1zQtjZ4//ejpSabcCPxZY8zwUyDUu59NaQWAuZ+2wI1F5dU9tOGD5lM+2lcpisGz4TJYy8+La4TyOUBZd0YQ8T4BE74rs6Y3zhf/BARBO4VH9UsnHLQmc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:to:subject:mime-version:content-type; b=HvPFKkZ4DJ1qFl5i8a5RKi9e6QBQGRAjhBbCXBLE30ZegR/SpEv5prxCQHBzQy31ghClGSTNsAcB/+W8gladxzpF8/k3OdHRlEoFlIB6cuDXPWonWaD1CYfExG1+OR7YhbDC5QGB1VFKqCjuVcawLjZRKPZfuD81fa+w0K0SlMQ= Received: by 10.142.108.14 with SMTP id g14mr2071315wfc.52.1206996404853; Mon, 31 Mar 2008 13:46:44 -0700 (PDT) Received: by 10.142.132.17 with HTTP; Mon, 31 Mar 2008 13:46:44 -0700 (PDT) Message-ID: <29a095670803311346s7fae5c9end7364694f9da8e98@mail.gmail.com> Date: Mon, 31 Mar 2008 16:46:44 -0400 From: "Alessandro Bologna" To: users@jackrabbit.apache.org Subject: XML, SNS, and JCR MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_16906_20676055.1206996404863" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_16906_20676055.1206996404863 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi all, One of the most fascinating thing about the JCR is that it always gives more to think. What follows is a very long message that is tryng to make the point that maybe we need another way to map XML to JCR and vice versa. Besides begin long, it is probably even boring, and probably even naive in some parts, so read it only if the topic matters to you... So, the story goes that after that Jukka proposed if it would be worth dropping support for Same Name Siblings, and knowing well how SNS are useful in mapping XML documents in the JCR, I wondered if there was something that was missing in the puzzle: XML has no issue with SNS, and XPATH (1.0 and 2.0) are quite happy with them too. At the same time, thinking of David's modeling suggestion "Beware of Same Name Siblings" seems to contradict the usage experience of those who come from an XML background. In other words, in XML is pretty normal to have: while it would be unusual something like: It's possible of course, just not the usual way people design XML. By the way, in the examples above I am using an attribute-centric model just for simplicity of comparison with the JCR properties. The same considerations would apply if I were to use child elements (and jcr:xmltext child nodes with a property jcr:xmlcharacters), but what matter is that, in XML, *the element name is quite always mapped to the type, not the instance*. In JCR modeling, this can lead to all the well known issues with same name siblings, so the approach is instead more "file" centric, where each element (node) is given an unique identifier, *unless it's not needed*: for example this (should) be ok in JCR: people | +---john.smith | +---- my:name: | | +--- -first: John | | +--- -last: Smith | +---- my:dob: | +--- -value: 10/01/1970 +---mary.smith +---- my:name: | +--- -first: Mary | +--- -last: Smith +---- my:dob: +--- -value: 11/07/1973 In order to avoid SNS, the idea is to use a parent-unique id for node name, where the conflict would arise, but it is not required for nodes that are logically already unique in their parent's context (for instance, my:name and my:dob). In this model, when needed, the node name can always be made unique, adding SSN, DOB, or something else if needed. This means that the XML/XPATH query */people/my:employee/my:name[@last='Smith'] * would need to be rewritten as in JCR/XPATH: */jcr:root/people/*/my:name[@**last='Smith'**]* because of course the node name is not known a priori, or (better) as */jcr:root/people/element(*,nt:base)/name[@**last='Smith'**]* The second notation uses the XPATH 2.0 element() function, that allows to select nodes of a specific type (or of a type that is inherited from the type). In XML, it uses the schema element name, in JCR, the node type. If we were using custom node types, and let's assume that we do from now on, then the JCR query above could have been written more specifically as: */jcr:root/people/element(*,my:employee)/my:name[@first='John'] * assuming a simple CND such as: [my:name] > nt:base - first: string - last: string [my:dob] > nt:base - value: string [my:employee] > nt:base + my:name = my:name + my:dob = my:dob Incidentally, custom node types are quite essentials when we could have several different types of nodes under 'people', for instance *my:employee*and *my:freelancer*: If I didn't have node types, and I wanted to find all the Smiths that are not freelancer, and not having access to the parent axis (it's not required in JCR), I would have to do: */jcr:root/people/*/my:name[@**last='Smith'**] *and then, in Java, find out which one has not a freelancer parent. Besides being tedious, it could be very inefficient. In traditional XLM modelling and querying, and unless we wanted to take advantage of inheritance, this would not be needed because the XPATH itself would allow distinguishing between the cases: */jcr:root/people/my:employee/my:name[@**last='Smith'**] * Of course, in both cases (JCR and XML) I could structure my data better and separate employees from freelancers under different nodes (*my:employees*and *my:freelancers*), and I would not have this problem; at the same time, when you can have multiple criteria, orthogonal or not, it becomes quite complex to choose which one is the best to be "hardwired" in the structure (what about male/female, working/retired, etc). The choice of what is driving the hierarchy and what is instead an attribute (or a property) sometimes is not obvious, and often turns out to be not the right one (when it's too late, typically...). The choice of viewing JCR structures as XML is not a side effect, it's part of the JCR specs, where it says that an XPATH query is run against the virtual XML document (6.6.4.10 and others). And an XML Document View is the *normal *way to look at the data as XML. (Of course, System View is the one to be used for round tripping, I know...). At the same time, as we see, this special relationship that JCR has with XML should not used to inspire the model, because SNS are complex to handle, and therefore nodes should have as name a parent-unique "id" and not their "type", and the element(*,my:type) function should be used wherever I really intend to select by the type of the node. Because of this, it is not unusual to have to write queries such as *//element(*,my:type)/element(*,my:other-type)[element(*,my:last-type)] * instead of *//my:type/my:other-type[my:last-type]* and this assuming that every node is strictly typed, which is not always desirable or possible. As another use case, in my application (yes, who cares?), XSLT stylesheets can access the repository by using a (RESTful) type of query that is expressed in JCR/XPATH, and they can work with the resulting document using XML/XPATH. This means that for instance, if my node's XML representation URI is (for instance) *http://localhost/jcr/default/blogs/2008/myfirstpost/blog* and the resulting document is: test

first paragraph

second paragraph

The nice things is that it's possible to use for instance *http://localhost/jcr/default/blogs/2008/myfirstpost/blog/headline * to get only the headline, or even: *http://localhost/jcr/default/blogs/2008/*/blog/headline * to get all the headlines in 2008. What I could not do, if SNS were not there, is: *http://localhost/jcr/default/blogs/2008/myfirstpost/blog/body/p[1]* to get the first paragraph on my blog, or * http://localhost/jcr/default/blogs/2008/*/blog/body/p[1]* to get all the first paragraphs in all post in 2008. So, even when nodes have unique names ('*myfirstpost'*), at a certain level 'below' same name siblings in the form of tags are likely to appear, and it's a nice thing, because it allows a seamless transition from the URI of a node representation as it is seen on the server to the URI of the element that is being processed. In other words, the URI space is continuous. Still, the dilemma remains: why in JCR modeling is best practice to name nodes with their contents, and in XML with their types? What I wonder is if it would not be a good idea to* introduce another type of Document View *(let's call it Normal View for now), where *node types are element names*, *properties are still attributes*, *and a jcr:name pseudo-attribute is added* *(instead of jcr:primaryType) to represent the node name. *In this case, I could write my query with 'old style' XPATH 1.0 (minus of course the order by), XML could still be used to inspire the model and SNS would be avoided. And, I believe, queries would be both simpler and would make more sense to XML developers, to the point that it would be easier to migrate an XML centric application in the JCR model (with some caveats, of course) With this feature, the JCR structure above could be queried with XPATH against it's virtual Normal View (in addition to the Document View): so, for instance, i could write: *//people//my:employee[2]/my:name* as an XPATH expression for the Normal View to find my second employee, *//people//my:employee[@id='john.smith']/my:dob* to find when the employee (not the freelancer) with id john.smith was born And what if no nodetypes are defined? Then the regular Document View based JCR/XPATH would be probably better suited, as the intent of the alternative Normal View is to express queries using XML style XPATH for nodes that are typed, and to disambiguate the way that XML documents are seen once imported in the JCR. So what about importing and exporting this view? In the JCR paradigm, or at least in Jackrabbit, importing XML (that is not generated by a System View export) means to map each element to a node, each attribute to a property. If the element does not have a jcr:primaryType attribute, then the element is created as nt:unstructured, the attributes as string and XML text nodes are created as jcr:xmltext children with a single property of type string (jcr:xmlcharacters). If instead the jcr:primaryType attribute is present, then Jackrabbit tries to map the XML to the corresponding nodetype, throwing an exception if it can't (for instance because of a conflicting structure). So, in addition to this behavior during import, another one could be introduced: *During import, each element that has a property jcr:name would be created as a node with name equal to the value of jcr:name, and with a node type equal to the element's name. If the node type is not present, it could be either created on the fly (as inherited fom nt:unstructured), or an exception could be thrown. Similarly, if an element does not have a jcr:name, or has a jcr:name identical to a sibling, an exception could be thrown, or a new id could be assigned silently. * For export, a new method exportNormalView() could be added to the already present exportDocumentView() and exportSystemView() and would export a materialized view of the virtual Normal View. In this way, importing XML in the repository would not create SNS, the element() function would be needed only when the type inheritance hierarchy needs to be evaluated, and, most important, people would not be confused anymore with modeling "the XML way" vs "the JCR way". Finally, the technical question. Is there a simple way to extend the XPATH parser to handle this type of queries? Or, has anybody had any experience plugging Jaxen in the JCR? Everything else seems to be a pretty straightforward thing to implement, even if just to see how it behaves in the real world. Of course, any thought, even an utterly critical thought, is welcome. Alessandro ------=_Part_16906_20676055.1206996404863--