jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S├ębastien Launay <sebastien.lau...@anyware-tech.com>
Subject Efficient but simple hash for storing lot of unidentifiable contents
Date Wed, 07 Jan 2009 13:25:41 GMT
Hi,

I have several applications where contents does not have ids or
immutable names when storing them into JCR, therefore we stored
contents below a common root node in a flatten style with same
name sibling feature.

When we have the same type of content in a application we use a custom
hash for cutting the id/name with the better balancing possible (i.e.
for content identified by a lastname we use the last 3 characters,
ex: smith => h/t/i/smith). 

Because we can have different type of content and ids are not available
or just different, i need to implement an efficient but simple hash
(no complex balanced tree) for storing these nodes and not having
problem with Jackrabbit design (max 10k subnodes).

A colleague of mine found an algorithm using a counter for creating
a generic hash:
- start a counter from 0
- increment for each new content in an atomic fashion
- convert the counter value to a string with zero-padded and a length
  multiple of 3
- reverse the string (for not using the same parent node and better
  balancing)
- cut the string into 3 length segments
- create or use an existing node holder with each segment
- the node of the content will be stored below the last node holder

Clearly, if we create 100 nodes (and use 2 length segment) we would have:
root/00/ (node type: contentHolder)
root/00/entry (node type: myContent)
root/10/ (node type: contentHolder)
root/10/entry (node type: myContent)
...
root/90/ (node type: contentHolder)
root/90/content (node type: myContent)
root/01/ (node type: contentHolder)
root/01/content (node type: myContent)
...
...
root/99/ (node type: contentHolder)
root/99/content (node type: myContent)

If we create 2 more nodes (#100 => 0100 => '00'10' and
                           #101 => 0101 => '10'10') we would have:
root/00/ (node type: contentHolder)
root/00/entry (node type: myContent)
root/00/10 (node type: contentHolder)
root/00/10/entry (node type: myContent)
root/10/ (node type: contentHolder)
root/10/entry (node type: myContent)
root/10/00 (node type: contentHolder)
root/10/00/entry (node type: myContent)
...
root/90/ (node type: contentHolder)
root/90/content (node type: myContent)
root/01/ (node type: contentHolder)
root/01/content (node type: myContent)
...
...
root/99/ (node type: contentHolder)
root/99/content (node type: myContent)

Note that a different parent node is used for storing these 2 new nodes.
The more contents are created, the deeper they are stored.
With 3 length segment, there is maximum of 1000+1 child nodes per node.

This works well but:
- storing the counter in the repository with a synchronized does not work
  in clustering, there is no way to do an atomic increment therefore we use
  a database in this case
- if we delete a content this create holes in the tree (holder without entry)
- we can manage to store unused holder for reusing purpose but we need to store
  them in a efficient and concurrently safe manner (like svnmerge properties)

So, here is my solution but i am very interested if you have the same kind of
problem and use a better algorithm.

--
S├ębastien Launay


Mime
View raw message