jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Nuescheler" <david.nuesche...@gmail.com>
Subject Re: importing jackrabbit into jackrabbit
Date Fri, 27 Apr 2007 07:37:23 GMT
Hi Alessandro,

thanks a lot for your thoughtful mail.
I think you hit the nail right on the head.

> I think that the main problem is not really about the specific case,
> but in general that when people design relational databases, they
> always use references (or more properly, joins) to define data that
> belongs logically to many entities, but should not duplicated.
I completely agree with your statement.
And I think this is one of the biggest challenges that we are
going to face.
People are thinking within the facilities provided by a relational
database and within the data modeling practices that they have
been using for decades now. Which is very understandable.
A content repository offers much richer facilities for content modelling
primarily through features like a hierarchy, multi-value properties or
even features like sorted children which in an RDBMS world have
to be modeled by the application developer.

> Imagine that you have a company tree, with "positions",
> "departments", "employees", "health plans" etc.
> An employee could belong to a department, have a position and an
> health plan, but typically you would not make all those nodes child
> nodes of the employee: you would instead define references to the
> proper node in the "position" and "health plan" subtrees.
I think one-to-many relationship should be modeled as a hierarchy.
So my initial gut feeling would be a datamodel like this:
/bigco
/bigco/marketingdept
/bigco/marketingdept/joeshmoe

and "joeshmoe" would be of nodetype

[bigco:employee]
- position
- healthplan

Now "position", "healthplan" are many-to-many relationships.
I think that those can either be modeled as references, paths,
names or strings.
People that come from a "hard structured" RDBMS background
very often think that a reference is the only option.

For example "position" might very well be a "string" or a "name"
if the application can deal with the fact that information is "dangling".

If we continue to model the above tree with...

/bigco/positions/
/bigco/positions/secretary
/bigco/positions/svp

... I think I would personally choose to store a "string"-property that is
human readable thats actually the name of the target node in
/bigco/positions.
So i would store "svp" or "secretary" in the position property.

Since I would not use namespaces for the names of the children
in "positions" I would not need the overhead of true name property in
my employee node.

While this probably rubs a lot "structure first" people the wrong
way I prefer this model since the information carried in the
string "secretary" is still valuable even if it is "dangling".
(...opposed to some UUID)

I think it is important to understand that there certainly are use cases
where referential integrity is very important, but it is important to understand
that it comes at a price.
Both in performance and even more importantly it constrains the
flexibility of your applications from a "data-first" perspective.

> What could be the right way to model things? Maybe using a "path"
> property to point to the node instead? Of course, it would not be as
> easy to use as a reference, and it would be requiring global updates
> if the pointed node ever change position, but I can't see other options.
If you would like to protect against "move"-operations but wants to avoid
the overhead of referential integrity, you can store the UUID of the target
in a string property. In JSR-283 we are looking at a "weak-reference" to
express a reference that can dangle in a more formal way.

> It's easy to see how, in a large company, there could be thousands of
> employee holding the same position and health plan, and those
> specific nodes ("Secretary" and "Plan A")  would have thousand of
> references pointing to them.
> So, given the issue  as explained by Marcel that "whenever a
> reference is added that points to a node N the complete set of
> references pointing to N is re-written to the persistence manager",
> it seems that using references to a node that is very "popular" is
> really going to be creating problems in the long term.
Agreed. And I think we will not be able to re-educate everybody with
an RDBMS background before using Jackrabbit so I think Jackrabbit has
to be able to deal with very large quantities of references in a very
efficient way.
So I would recommend to fix that as noted by Tom in the last sentence of:
http://issues.apache.org/jira/browse/JCR-657

regards,
david

Mime
View raw message