after having some time of indirect contact with JCR throught sling and day crx/cq I now think it's time to get in touch with jackrabbit directly. As the subject says I do this after having an idea which I'd like to share and need some help to realize (since my lucene experiences are close to nothing but pure usage & theory). I did try to start with a proof of concept but as I looked in the current implementations of search in jcr I had to realize I need someone who could give me a jumpstart and does the first steps together with me. So here I go with my idea:
I recently had some thoughts about something I'd call sementic distance
in multidimensional hierachies (content structures + hierarchical
tagging like in CQ 5 ).
The task I would like to fullfill: Find the semantically closest nodes for a given node.
I postulate that structure represents the semantic relation as well
as the referenced tags are in a hierarchie that represents semantic
Furthermore I postulate subnodes are semanticaly a subset
of the "type" of the parentnode (not thinking of jcr-types but in
This leads into the following thesis: The distance to the closest
shared parentnode represents the unidirectional distance of a node to
The result is that a whole branch has the same distance to a node. (which should be correct since the subnode in the branch belongs to the parent node which connects the branches we have to look at).
My try to figure out a good way to produce an index for this really seams to be hard
so I rethought my assumptions and came up with the following way of
determining the distance without indexing the explicit distance (came
up with this thought after reading a bit about the Analyzers and
1. For indexing all referenced taghandles and the own handle will be taken into account for indexing
an analyzer produces stringtokens out of each handle. Each handle will
be split up in multiple handles by removing the last node till the
rootnode is reached (so the node and every parentnode is indexed for
this node as well as for each referenced tag)
3. The query has to built based on a given handle since I want to search for the semantically closest nodes.
4. The query is built the same way as the Analyzer has to split the handle in all parent handles.
A 100% match can only be produced for the same node (for all other
nodes at least the own handle of the node is missing). The
"semantically" closer a node is the more handles will match wich will
result in an ordering as I intended. Et Voilá we have all we need to
search for search semantically close pages in a proper sorting order.
I might have a gap in my conclusions but didn't realise it yet, Id love to have some feedback and would appreciate some help to get startet with the mentioned proof of concept.