Hi everybody,

after having some time of indirect contact with JCR throught sling and day crx/cq I now think it's time to get in touch with jackrabbit directly. As the subject says I do this after having an idea which I'd like to share and need some help to realize (since my lucene experiences are close to nothing but pure usage & theory). I did try to start with a proof of concept but as I looked in the current implementations of search in jcr I had to realize I need someone who could give me a jumpstart and does the first steps together with me. So here I go with my idea:

I recently had some thoughts about something I'd call sementic distance in multidimensional hierachies (content structures + hierarchical tagging like in CQ 5 [1]).

The task I would like to fullfill: Find the semantically closest nodes for a given node.

I postulate that structure represents the semantic relation as well as the referenced tags are in a hierarchie that represents semantic relations.
Furthermore I postulate subnodes are semanticaly a subset of the "type" of the parentnode (not thinking of jcr-types but in semantical classifications)
This leads into the following thesis: The distance to the closest shared parentnode represents the unidirectional distance of a node to another node.
The result is that a whole branch has the same distance to a node. (which should be correct since the subnode in the branch belongs to the parent node which connects the branches we have to look at).

My try to figure out a good way to produce an index for this really seams to be hard so I rethought my assumptions and came up with the following way of determining the distance without indexing the explicit distance (came up with this thought after reading a bit about the Analyzers and Stemming).

1. For indexing all referenced taghandles and the own handle will be taken into account for indexing
2. an analyzer produces stringtokens out of each handle. Each handle will be split up in multiple handles by removing the last node till the rootnode is reached (so the node and every parentnode is indexed for this node as well as for each referenced tag)
3. The query has to built based on a given handle since I want to search for the semantically closest nodes.
4. The query is built the same way as the Analyzer has to split the handle in all parent handles.
Result: A 100% match can only be produced for the same node (for all other nodes at least the own handle of the node is missing). The "semantically" closer a node is the more handles will match wich will result in an ordering as I intended. Et Voilá we have all we need to search for search semantically close pages in a proper sorting order.

I might have a gap in my conclusions but didn't realise it yet, Id love to have some feedback and would appreciate some help to get startet with the mentioned proof of concept.


Best regards,

[1] http://dev.day.com/microsling/content/blogs/main/cq5tags.html