Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of alessandro.bologna@gmail.com
 designates 216.239.58.189 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=message-id:date:from:to:subject:mime-version:content-type;
        b=HvPFKkZ4DJ1qFl5i8a5RKi9e6QBQGRAjhBbCXBLE30ZegR/SpEv5prxCQHBzQy31ghClGSTNsAcB/+W8gladxzpF8/k3OdHRlEoFlIB6cuDXPWonWaD1CYfExG1+OR7YhbDC5QGB1VFKqCjuVcawLjZRKPZfuD81fa+w0K0SlMQ=
Message-ID: <29a095670803311346s7fae5c9end7364694f9da8e98@mail.gmail.com>
Date: Mon, 31 Mar 2008 16:46:44 -0400
From: "Alessandro Bologna" <alessandro.bologna@gmail.com>
To: users@jackrabbit.apache.org
Subject: XML, SNS, and JCR
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_16906_20676055.1206996404863"

------=_Part_16906_20676055.1206996404863
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi all,

One of the most fascinating thing about the JCR is that it always gives more
to think.

What follows is a very long message that is tryng to make the point that
maybe we need another way to map XML to JCR and vice versa. Besides begin
long, it is probably even boring, and probably even naive in some parts, so
read it only if the topic matters to you...

So, the story goes that after that Jukka proposed if it would be worth
dropping support for Same Name Siblings, and knowing well how SNS are useful
in mapping XML documents in the JCR, I wondered if there was something that
was missing in the puzzle: XML has no issue with SNS, and XPATH (1.0 and 2.0)
are quite happy with them too. At the same time, thinking of David's
modeling suggestion "Beware of Same Name Siblings" seems to contradict the
usage experience of those who come from an XML background.

In other words, in XML is pretty normal to have:

<people>
<my:employee>
  <my:name first="John" last="Smith"/>
  <my:dob value="10/01/1970">
</my:employee>
<my:employee>
  <my:name first="Mary" last="Smith"/>
  <my:dob value="11/07/1973">
</my:employee>
</people>

while it would be unusual something like:

<people>
<john.smith>
  <my:name first="John" last="Smith"/>
  <my:dob value="10/01/1970">
</john.smith>
<mary.smith>
  <my:name first="Mary" last="Smith"/>
  <my:dob value="11/07/1973">
</mary.smith>
</people>

It's possible of course, just not the usual way people design XML.
By the way, in the examples above I am using an attribute-centric model just
for simplicity of comparison with the JCR properties.

The same considerations would apply if I were to use child elements (and
jcr:xmltext child nodes with a property jcr:xmlcharacters), but what matter
is that, in XML, *the element name is quite always mapped to the type, not
the instance*.

In JCR modeling, this can lead to all the well known issues with same name
siblings, so the approach is instead more "file" centric, where each element
(node) is given an unique identifier, *unless it's not needed*: for example
this (should) be ok in JCR:

people
|
+---john.smith
|     +---- my:name:
|     |       +--- -first: John
|     |       +--- -last: Smith
|     +---- my:dob:
|             +--- -value: 10/01/1970
+---mary.smith
      +---- my:name:
      |       +--- -first: Mary
      |       +--- -last: Smith
      +---- my:dob:
              +--- -value: 11/07/1973


In order to avoid SNS, the idea is to use a parent-unique id for node name,
where the conflict would arise, but it is not required for nodes that are
logically already unique in their parent's context (for instance, my:name
and my:dob). In this model, when needed, the node name can always be made
unique, adding SSN, DOB, or something else if needed.

This means that the XML/XPATH query

*/people/my:employee/my:name[@last='Smith']  *

would need to be rewritten as in JCR/XPATH:

*/jcr:root/people/*/my:name[@**last='Smith'**]*

because of course the node name is not known a priori, or (better) as

*/jcr:root/people/element(*,nt:base)/name[@**last='Smith'**]*

The second notation uses the XPATH 2.0 element() function, that allows to
select nodes of a specific type  (or of a type that is inherited from the
type). In XML, it uses the schema element name, in JCR, the node type.

If we were using custom node types, and let's assume that we do from now on,
then the JCR query above could have been written more specifically as:

*/jcr:root/people/element(*,my:employee)/my:name[@first='John'] *

assuming a simple CND such as:

[my:name] > nt:base
  - first: string
  - last:  string
[my:dob]  > nt:base
  - value: string

[my:employee] > nt:base
  + my:name = my:name
  + my:dob = my:dob

Incidentally, custom node types are quite essentials when we could have
several different types of nodes under 'people', for instance *my:employee*and
*my:freelancer*:

If I didn't have node types, and I wanted to find all the Smiths that are
not freelancer, and not having access to the parent axis (it's not required
in JCR), I would have to do:

*/jcr:root/people/*/my:name[@**last='Smith'**]

*and then, in Java, find out which one has not a freelancer parent. Besides
being tedious, it could be very inefficient.

In traditional XLM modelling and querying, and unless we wanted to take
advantage of inheritance, this would not be needed because the XPATH itself
would allow distinguishing between the cases:

*/jcr:root/people/my:employee/my:name[@**last='Smith'**] *

Of course, in both cases (JCR and XML) I could structure my data better and
separate employees from freelancers under different nodes (*my:employees*and
*my:freelancers*), and I would not have this problem; at the same time, when
you can have multiple criteria, orthogonal or not, it becomes quite complex
to choose which one is the best to be "hardwired" in the structure (what
about male/female, working/retired, etc).

The choice of what is driving the hierarchy and what is instead an attribute
(or a property) sometimes is not obvious, and often turns out to be not the
right one (when it's too late, typically...).

The choice of viewing JCR structures as XML is not a side effect, it's part
of the JCR specs, where it says that an XPATH query is run  against the
virtual XML document (6.6.4.10 and others). And an XML Document View is the
*normal *way to look at the data as XML. (Of course, System View is the one
to be used for round tripping, I know...).

At the same time, as we see, this special relationship that JCR has with XML
should not used to inspire the model, because SNS are complex to handle, and
therefore nodes should have as name a parent-unique "id" and not their
"type", and the element(*,my:type) function should be used wherever I really
intend to select by the type of the node.

Because of this, it is not unusual to have to write queries such as

*//element(*,my:type)/element(*,my:other-type)[element(*,my:last-type)] *

instead of

*//my:type/my:other-type[my:last-type]*

and this assuming that every node is strictly typed, which is not always
desirable or possible.

As another use case, in my application (yes, who cares?), XSLT stylesheets
can access the repository by using a (RESTful) type of query that is
expressed in JCR/XPATH, and they can work with the resulting document using
XML/XPATH. This means that for instance, if my node's XML representation
URI is (for instance)

*http://localhost/jcr/default/blogs/2008/myfirstpost/blog*

and the resulting document is:

<blog>
  <headline>test</headline>
  <body>
   <p>first paragraph</p>
   <p>second paragraph</p>
  </body>
</blog>

The nice things is that it's possible to use for instance

*http://localhost/jcr/default/blogs/2008/myfirstpost/blog/headline *

to get only the headline, or even:

*http://localhost/jcr/default/blogs/2008/*/blog/headline *

to get all the headlines in 2008.

What I could not do, if SNS were not there, is:

*http://localhost/jcr/default/blogs/2008/myfirstpost/blog/body/p[1]*

to get the first paragraph on my blog, or *
http://localhost/jcr/default/blogs/2008/*/blog/body/p[1]* to get all the
first paragraphs in all post in 2008. So, even when nodes have unique names
('*myfirstpost'*), at a certain level 'below'  same name siblings in the
form of tags are likely to appear, and it's a nice thing, because it allows
a seamless transition from the URI of a node representation as it is seen on
the server to the URI of the element that is being processed. In other
words, the URI space is continuous.

Still, the dilemma remains: why in JCR modeling is best practice to name
nodes with their contents, and in XML with their types?

What I wonder is if it would not be a good idea to* introduce another type
of Document View *(let's call it Normal View for  now), where *node types
are element names*, *properties are still attributes*, *and a jcr:name
pseudo-attribute is added* *(instead of jcr:primaryType) to represent the
node name.

*In this case, I could write my query with 'old style' XPATH 1.0 (minus of
course the order by), XML could still be used to inspire the model and SNS
would be avoided. And, I believe, queries would be both simpler and would
make more sense to XML developers, to the point that it would be easier to
migrate an XML centric application in the JCR model (with some caveats, of
course)

With this feature, the JCR structure above could be queried with XPATH
against it's virtual Normal View (in addition to the Document View):

<people>
  <my:employee jcr:name="john.smith">
    <my:name first="John" last="Smith"/>
    <my:dob value="10/01/1970">
  </my:employee>
  <my:employee jcr:name="john.smith">
    <my:name first="Mary" last="Smith"/>
    <my:dob value="11/07/1973">
  </my:employee>
</people>

so, for instance, i could write:

*//people//my:employee[2]/my:name* as an XPATH expression for the Normal
View to find my second employee,
*//people//my:employee[@id='john.smith']/my:dob* to find when the employee
(not the freelancer) with id john.smith was born

And what if no nodetypes are defined? Then the regular Document View based
JCR/XPATH would be probably better suited, as the intent of the alternative
Normal View is to express queries using XML style XPATH for nodes that are
typed, and to disambiguate the way that XML documents are seen once imported
in the JCR.

So what about importing and exporting this view?

In the JCR paradigm, or at least in Jackrabbit, importing XML (that is not
generated by a System View export) means to map each element to a node, each
attribute to a property. If the element does not have a jcr:primaryType
attribute, then the element is created as nt:unstructured, the attributes as
string and XML text nodes are created  as jcr:xmltext children with a single
property of type string (jcr:xmlcharacters). If instead the jcr:primaryType
attribute is present, then Jackrabbit tries to map the XML to the
corresponding nodetype, throwing an exception if it can't (for instance
because of a conflicting structure).

So, in addition to this behavior during import, another one could be
introduced:

*During import, each element that has a property jcr:name would be created
as a node with name equal to the value of jcr:name, and with a node type
equal to the element's name. If the node type is not present, it could be
either created on the fly (as inherited fom nt:unstructured), or an
exception could be thrown. Similarly, if an element does not have a
jcr:name, or has a jcr:name identical to a sibling, an exception could be
thrown, or a new id could be assigned silently.
*
For export, a new method exportNormalView() could be added to the already
present exportDocumentView() and exportSystemView() and would export a
materialized view of the virtual Normal View.

In this way, importing XML in the repository would not create SNS, the
element() function would be needed only when the type inheritance hierarchy
needs to be evaluated, and, most important, people would not be confused
anymore with modeling "the XML way" vs "the JCR way".

Finally, the technical question. Is there a simple way to extend the XPATH
parser to handle this type of queries? Or, has anybody had any experience
plugging Jaxen in the JCR? Everything else seems to be a pretty
straightforward thing to implement, even if just to see how it behaves in
the real world.

Of course, any thought, even an utterly critical thought, is welcome.
Alessandro

------=_Part_16906_20676055.1206996404863--