jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Süß <dominik.su...@gmail.com>
Subject Semantic distance search
Date Thu, 21 May 2009 13:16:46 GMT
Hi everybody,

after having some time of indirect contact with JCR throught sling and day
crx/cq I now think it's time to get in touch with jackrabbit directly. As
the subject says I do this after having an idea which I'd like to share and
need some help to realize (since my lucene experiences are close to nothing
but pure usage & theory). I did try to start with a proof of concept but as
I looked in the current implementations of search in jcr I had to realize I
need someone who could give me a jumpstart and does the first steps together
with me. So here I go with my idea:

I recently had some thoughts about something I'd call sementic distance in
multidimensional hierachies (content structures + hierarchical tagging like
in CQ 5 [1]).

The task I would like to fullfill: Find the semantically closest nodes for a
given node.

I postulate that structure represents the semantic relation as well as the
referenced tags are in a hierarchie that represents semantic relations.
Furthermore I postulate subnodes are semanticaly a subset of the "type" of
the parentnode (not thinking of jcr-types but in semantical classifications)
This leads into the following thesis: The distance to the closest shared
parentnode represents the unidirectional distance of a node to another node.
The result is that a whole branch has the same distance to a node. (which
should be correct since the subnode in the branch belongs to the parent node
which connects the branches we have to look at).

My try to figure out a good way to produce an index for this really seams to
be hard so I rethought my assumptions and came up with the following way of
determining the distance without indexing the explicit distance (came up
with this thought after reading a bit about the Analyzers and Stemming).

1. For indexing all referenced taghandles and the own handle will be taken
into account for indexing
2. an analyzer produces stringtokens out of each handle. Each handle will be
split up in multiple handles by removing the last node till the rootnode is
reached (so the node and every parentnode is indexed for this node as well
as for each referenced tag)
3. The query has to built based on a given handle since I want to search for
the semantically closest nodes.
4. The query is built the same way as the Analyzer has to split the handle
in all parent handles.
Result: A 100% match can only be produced for the same node (for all other
nodes at least the own handle of the node is missing). The "semantically"
closer a node is the more handles will match wich will result in an ordering
as I intended. Et Voilá we have all we need to search for search
semantically close pages in a proper sorting order.

I might have a gap in my conclusions but didn't realise it yet, Id love to
have some feedback and would appreciate some help to get startet with the
mentioned proof of concept.

WDYT?

Best regards,
Dominik

[1] http://dev.day.com/microsling/content/blogs/main/cq5tags.html

Mime
View raw message