jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Kurla" <stefan.ku...@gmail.com>
Subject Re: importing jackrabbit into jackrabbit
Date Fri, 27 Apr 2007 17:57:39 GMT
Hi David,

Thank you for your post and explanation. Very nicely put.

It helps and yes this does rub me the wrong way, not because I believe
in "structure first" but because I believe that all we are doing here
is a workaround and not addressing the issue. Frankly, I am surprised
how this issue has not percolated up in the chain of priority items to
fix.

> > I think that the main problem is not really about the specific case,
> > but in general that when people design relational databases, they
> > always use references (or more properly, joins) to define data that
> > belongs logically to many entities, but should not duplicated.
> I completely agree with your statement.
> And I think this is one of the biggest challenges that we are
> going to face.
> People are thinking within the facilities provided by a relational
> database and within the data modeling practices that they have
> been using for decades now. Which is very understandable.
> A content repository offers much richer facilities for content modelling
> primarily through features like a hierarchy, multi-value properties or
> even features like sorted children which in an RDBMS world have
> to be modeled by the application developer.
>
> > Imagine that you have a company tree, with "positions",
> > "departments", "employees", "health plans" etc.
> > An employee could belong to a department, have a position and an
> > health plan, but typically you would not make all those nodes child
> > nodes of the employee: you would instead define references to the
> > proper node in the "position" and "health plan" subtrees.
> I think one-to-many relationship should be modeled as a hierarchy.
> So my initial gut feeling would be a datamodel like this:
> /bigco
> /bigco/marketingdept
> /bigco/marketingdept/joeshmoe
>
> and "joeshmoe" would be of nodetype
>
> [bigco:employee]
> - position
> - healthplan
>
> Now "position", "healthplan" are many-to-many relationships.
> I think that those can either be modeled as references, paths,
> names or strings.
> People that come from a "hard structured" RDBMS background
> very often think that a reference is the only option.
>
> For example "position" might very well be a "string" or a "name"
> if the application can deal with the fact that information is "dangling".
>
> If we continue to model the above tree with...
>
> /bigco/positions/
> /bigco/positions/secretary
> /bigco/positions/svp
>
> ... I think I would personally choose to store a "string"-property that is
> human readable thats actually the name of the target node in
> /bigco/positions.
> So i would store "svp" or "secretary" in the position property.
>
> Since I would not use namespaces for the names of the children
> in "positions" I would not need the overhead of true name property in
> my employee node.

This is a good workaround, and an excellent example.

>
> While this probably rubs a lot "structure first" people the wrong
> way I prefer this model since the information carried in the
> string "secretary" is still valuable even if it is "dangling".
> (...opposed to some UUID)
>
> I think it is important to understand that there certainly are use cases
> where referential integrity is very important, but it is important to understand
> that it comes at a price.
> Both in performance and even more importantly it constrains the
> flexibility of your applications from a "data-first" perspective.
>
> > What could be the right way to model things? Maybe using a "path"
> > property to point to the node instead? Of course, it would not be as
> > easy to use as a reference, and it would be requiring global updates
> > if the pointed node ever change position, but I can't see other options.
> If you would like to protect against "move"-operations but wants to avoid
> the overhead of referential integrity, you can store the UUID of the target
> in a string property. In JSR-283 we are looking at a "weak-reference" to
> express a reference that can dangle in a more formal way.
>
> > It's easy to see how, in a large company, there could be thousands of
> > employee holding the same position and health plan, and those
> > specific nodes ("Secretary" and "Plan A")  would have thousand of
> > references pointing to them.
> > So, given the issue  as explained by Marcel that "whenever a
> > reference is added that points to a node N the complete set of
> > references pointing to N is re-written to the persistence manager",
> > it seems that using references to a node that is very "popular" is
> > really going to be creating problems in the long term.
> Agreed. And I think we will not be able to re-educate everybody with
> an RDBMS background before using Jackrabbit so I think Jackrabbit has
> to be able to deal with very large quantities of references in a very
> efficient way.
> So I would recommend to fix that as noted by Tom in the last sentence of:
> http://issues.apache.org/jira/browse/JCR-657


"re-educate"- I do not think that this has anything to do with RDBMS,
this is basic filing and bookkeeping procedures here. Say you are the
CEO of the company and your reference is in multiple contracts and
legal proceedings that the company is involved in. Would you not want
to keep a reference to the master file in the contracts that are being
filed as a manner of upkeeping.

+1 on fixing this problem.

Mime
View raw message