jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Nuescheler" <da...@day.com>
Subject Re: XML, SNS, and JCR
Date Mon, 31 Mar 2008 22:06:06 GMT
Hi Alessandro,

thanks for the very thoughtful and inspiring post.

I think you bring up many interesting points, it may even be worth
splitting things into various different conversations.

First of all, congratulations to the restful URLs that you mention in
your post, which is something that i do not see very often. You
may find that the URL mapping in Apache Sling [1] is very similar.

I have to admit that when I crafted the "Beware of SNS" [2] rule I
thought of people modelling in the content repository that come from
an database background and I will probably have to look at things
again from a XML perspective.

I think the approach to the "Normal View" is very intriguing, and after
thinking it through briefly I think it would not be too hard to implement
neither for import/export nor for XPath query and yet as you point
out would avoid the XML vs. JCR datamodel dilemma.

As you mention the DocView is for round tripping arbitrary XML while
the SysView is for round tripping arbitrary content. If I am not mistaken
the "Normal View" would not allow either of the two, but would add a lot
of value for an efficient way to deal with something that I would call
JCR aware XML". I think it would be interesting to find out what the
characteristics and limitations of such a view are both from an XML (import)
and from JCR (export/query) perspective are. I assume we would end up
with an the same limitations as the DocView from a JCR perspective
and possibly with the limitation that the XML elements would have to
match to pre-registered (possibly auto-defined & registered) node types.

Is that correct?

Very interesting idea.


[1] http://incubator.apache.org/sling/site/index.html
[2] http://wiki.apache.org/jackrabbit/DavidsModel#head-1df0224190c265f5156f037eb3f20e314fa6c4a7

On Mon, Mar 31, 2008 at 10:46 PM, Alessandro Bologna
<alessandro.bologna@gmail.com> wrote:
> Hi all,
>  One of the most fascinating thing about the JCR is that it always gives more
>  to think.
>  What follows is a very long message that is tryng to make the point that
>  maybe we need another way to map XML to JCR and vice versa. Besides begin
>  long, it is probably even boring, and probably even naive in some parts, so
>  read it only if the topic matters to you...
>  So, the story goes that after that Jukka proposed if it would be worth
>  dropping support for Same Name Siblings, and knowing well how SNS are useful
>  in mapping XML documents in the JCR, I wondered if there was something that
>  was missing in the puzzle: XML has no issue with SNS, and XPATH (1.0 and 2.0)
>  are quite happy with them too. At the same time, thinking of David's
>  modeling suggestion "Beware of Same Name Siblings" seems to contradict the
>  usage experience of those who come from an XML background.
>  In other words, in XML is pretty normal to have:
>  <people>
>  <my:employee>
>   <my:name first="John" last="Smith"/>
>   <my:dob value="10/01/1970">
>  </my:employee>
>  <my:employee>
>   <my:name first="Mary" last="Smith"/>
>   <my:dob value="11/07/1973">
>  </my:employee>
>  </people>
>  while it would be unusual something like:
>  <people>
>  <john.smith>
>   <my:name first="John" last="Smith"/>
>   <my:dob value="10/01/1970">
>  </john.smith>
>  <mary.smith>
>   <my:name first="Mary" last="Smith"/>
>   <my:dob value="11/07/1973">
>  </mary.smith>
>  </people>
>  It's possible of course, just not the usual way people design XML.
>  By the way, in the examples above I am using an attribute-centric model just
>  for simplicity of comparison with the JCR properties.
>  The same considerations would apply if I were to use child elements (and
>  jcr:xmltext child nodes with a property jcr:xmlcharacters), but what matter
>  is that, in XML, *the element name is quite always mapped to the type, not
>  the instance*.
>  In JCR modeling, this can lead to all the well known issues with same name
>  siblings, so the approach is instead more "file" centric, where each element
>  (node) is given an unique identifier, *unless it's not needed*: for example
>  this (should) be ok in JCR:
>  people
>  |
>  +---john.smith
>  |     +---- my:name:
>  |     |       +--- -first: John
>  |     |       +--- -last: Smith
>  |     +---- my:dob:
>  |             +--- -value: 10/01/1970
>  +---mary.smith
>       +---- my:name:
>       |       +--- -first: Mary
>       |       +--- -last: Smith
>       +---- my:dob:
>               +--- -value: 11/07/1973
>  In order to avoid SNS, the idea is to use a parent-unique id for node name,
>  where the conflict would arise, but it is not required for nodes that are
>  logically already unique in their parent's context (for instance, my:name
>  and my:dob). In this model, when needed, the node name can always be made
>  unique, adding SSN, DOB, or something else if needed.
>  This means that the XML/XPATH query
>  */people/my:employee/my:name[@last='Smith']  *
>  would need to be rewritten as in JCR/XPATH:
>  */jcr:root/people/*/my:name[@**last='Smith'**]*
>  because of course the node name is not known a priori, or (better) as
>  */jcr:root/people/element(*,nt:base)/name[@**last='Smith'**]*
>  The second notation uses the XPATH 2.0 element() function, that allows to
>  select nodes of a specific type  (or of a type that is inherited from the
>  type). In XML, it uses the schema element name, in JCR, the node type.
>  If we were using custom node types, and let's assume that we do from now on,
>  then the JCR query above could have been written more specifically as:
>  */jcr:root/people/element(*,my:employee)/my:name[@first='John'] *
>  assuming a simple CND such as:
>  [my:name] > nt:base
>   - first: string
>   - last:  string
>  [my:dob]  > nt:base
>   - value: string
>  [my:employee] > nt:base
>   + my:name = my:name
>   + my:dob = my:dob
>  Incidentally, custom node types are quite essentials when we could have
>  several different types of nodes under 'people', for instance *my:employee*and
>  *my:freelancer*:
>  If I didn't have node types, and I wanted to find all the Smiths that are
>  not freelancer, and not having access to the parent axis (it's not required
>  in JCR), I would have to do:
>  */jcr:root/people/*/my:name[@**last='Smith'**]
>  *and then, in Java, find out which one has not a freelancer parent. Besides
>  being tedious, it could be very inefficient.
>  In traditional XLM modelling and querying, and unless we wanted to take
>  advantage of inheritance, this would not be needed because the XPATH itself
>  would allow distinguishing between the cases:
>  */jcr:root/people/my:employee/my:name[@**last='Smith'**] *
>  Of course, in both cases (JCR and XML) I could structure my data better and
>  separate employees from freelancers under different nodes (*my:employees*and
>  *my:freelancers*), and I would not have this problem; at the same time, when
>  you can have multiple criteria, orthogonal or not, it becomes quite complex
>  to choose which one is the best to be "hardwired" in the structure (what
>  about male/female, working/retired, etc).
>  The choice of what is driving the hierarchy and what is instead an attribute
>  (or a property) sometimes is not obvious, and often turns out to be not the
>  right one (when it's too late, typically...).
>  The choice of viewing JCR structures as XML is not a side effect, it's part
>  of the JCR specs, where it says that an XPATH query is run  against the
>  virtual XML document ( and others). And an XML Document View is the
>  *normal *way to look at the data as XML. (Of course, System View is the one
>  to be used for round tripping, I know...).
>  At the same time, as we see, this special relationship that JCR has with XML
>  should not used to inspire the model, because SNS are complex to handle, and
>  therefore nodes should have as name a parent-unique "id" and not their
>  "type", and the element(*,my:type) function should be used wherever I really
>  intend to select by the type of the node.
>  Because of this, it is not unusual to have to write queries such as
>  *//element(*,my:type)/element(*,my:other-type)[element(*,my:last-type)] *
>  instead of
>  *//my:type/my:other-type[my:last-type]*
>  and this assuming that every node is strictly typed, which is not always
>  desirable or possible.
>  As another use case, in my application (yes, who cares?), XSLT stylesheets
>  can access the repository by using a (RESTful) type of query that is
>  expressed in JCR/XPATH, and they can work with the resulting document using
>  XML/XPATH. This means that for instance, if my node's XML representation
>  URI is (for instance)
>  *http://localhost/jcr/default/blogs/2008/myfirstpost/blog*
>  and the resulting document is:
>  <blog>
>   <headline>test</headline>
>   <body>
>    <p>first paragraph</p>
>    <p>second paragraph</p>
>   </body>
>  </blog>
>  The nice things is that it's possible to use for instance
>  *http://localhost/jcr/default/blogs/2008/myfirstpost/blog/headline *
>  to get only the headline, or even:
>  *http://localhost/jcr/default/blogs/2008/*/blog/headline *
>  to get all the headlines in 2008.
>  What I could not do, if SNS were not there, is:
>  *http://localhost/jcr/default/blogs/2008/myfirstpost/blog/body/p[1]*
>  to get the first paragraph on my blog, or *
>  http://localhost/jcr/default/blogs/2008/*/blog/body/p[1]* to get all the
>  first paragraphs in all post in 2008. So, even when nodes have unique names
>  ('*myfirstpost'*), at a certain level 'below'  same name siblings in the
>  form of tags are likely to appear, and it's a nice thing, because it allows
>  a seamless transition from the URI of a node representation as it is seen on
>  the server to the URI of the element that is being processed. In other
>  words, the URI space is continuous.
>  Still, the dilemma remains: why in JCR modeling is best practice to name
>  nodes with their contents, and in XML with their types?
>  What I wonder is if it would not be a good idea to* introduce another type
>  of Document View *(let's call it Normal View for  now), where *node types
>  are element names*, *properties are still attributes*, *and a jcr:name
>  pseudo-attribute is added* *(instead of jcr:primaryType) to represent the
>  node name.
>  *In this case, I could write my query with 'old style' XPATH 1.0 (minus of
>  course the order by), XML could still be used to inspire the model and SNS
>  would be avoided. And, I believe, queries would be both simpler and would
>  make more sense to XML developers, to the point that it would be easier to
>  migrate an XML centric application in the JCR model (with some caveats, of
>  course)
>  With this feature, the JCR structure above could be queried with XPATH
>  against it's virtual Normal View (in addition to the Document View):
>  <people>
>   <my:employee jcr:name="john.smith">
>     <my:name first="John" last="Smith"/>
>     <my:dob value="10/01/1970">
>   </my:employee>
>   <my:employee jcr:name="john.smith">
>     <my:name first="Mary" last="Smith"/>
>     <my:dob value="11/07/1973">
>   </my:employee>
>  </people>
>  so, for instance, i could write:
>  *//people//my:employee[2]/my:name* as an XPATH expression for the Normal
>  View to find my second employee,
>  *//people//my:employee[@id='john.smith']/my:dob* to find when the employee
>  (not the freelancer) with id john.smith was born
>  And what if no nodetypes are defined? Then the regular Document View based
>  JCR/XPATH would be probably better suited, as the intent of the alternative
>  Normal View is to express queries using XML style XPATH for nodes that are
>  typed, and to disambiguate the way that XML documents are seen once imported
>  in the JCR.
>  So what about importing and exporting this view?
>  In the JCR paradigm, or at least in Jackrabbit, importing XML (that is not
>  generated by a System View export) means to map each element to a node, each
>  attribute to a property. If the element does not have a jcr:primaryType
>  attribute, then the element is created as nt:unstructured, the attributes as
>  string and XML text nodes are created  as jcr:xmltext children with a single
>  property of type string (jcr:xmlcharacters). If instead the jcr:primaryType
>  attribute is present, then Jackrabbit tries to map the XML to the
>  corresponding nodetype, throwing an exception if it can't (for instance
>  because of a conflicting structure).
>  So, in addition to this behavior during import, another one could be
>  introduced:
>  *During import, each element that has a property jcr:name would be created
>  as a node with name equal to the value of jcr:name, and with a node type
>  equal to the element's name. If the node type is not present, it could be
>  either created on the fly (as inherited fom nt:unstructured), or an
>  exception could be thrown. Similarly, if an element does not have a
>  jcr:name, or has a jcr:name identical to a sibling, an exception could be
>  thrown, or a new id could be assigned silently.
>  *
>  For export, a new method exportNormalView() could be added to the already
>  present exportDocumentView() and exportSystemView() and would export a
>  materialized view of the virtual Normal View.
>  In this way, importing XML in the repository would not create SNS, the
>  element() function would be needed only when the type inheritance hierarchy
>  needs to be evaluated, and, most important, people would not be confused
>  anymore with modeling "the XML way" vs "the JCR way".
>  Finally, the technical question. Is there a simple way to extend the XPATH
>  parser to handle this type of queries? Or, has anybody had any experience
>  plugging Jaxen in the JCR? Everything else seems to be a pretty
>  straightforward thing to implement, even if just to see how it behaves in
>  the real world.
>  Of course, any thought, even an utterly critical thought, is welcome.
>  Alessandro

View raw message