[sorry for crossposting, but I think several people might be interested
in knowing this and not many of them are subscribed to lucene-dev.
Please, make sure you hit the reply-all button when replying or the
discussion will be splitted]
Doug Cutting wrote:
>
> > From: Stefano Mazzocchi [mailto:stefano@apache.org]
> >
> > Anyway, a possible solution would be to add the ability of add a
> > 'boost-factor' to each token so that the Scorer can perform
> > hits rating
> > based on this information (the search phase could not be influenced by
> > this boost factors).
>
> A simple approach is to add emphasized terms to a separate field, and always
> search for terms in both the normal field and the emphasized field. Because
> the emphasized field is shorter, matches in it boost scores more than those
> in the normal field, in the same way that "title" matches are stronger than
> "body" matches.
>
> I made a proposal a while back which could also be used to achieve this. It
> is not the most elegant solution, but a solution nonetheless.
[snip]
I see, but my previous example was just the tip of the iceberg.
Please, consider the following XML document:
I like Klingon semantic tags.
One indexing solution would be to ignore those tags alltogether and
index the included text. This means loosing all the semantic content
that might be associated with those tags.
Another solution is to add a different field for each element/attribute
on its own namespace. This means associating the text to its semantic
context. No information is lost, but the search requires the user to
identifiy the text in a specific context and this is normally not
feasible/useful/tollerant.
Let us analyze this from the linear algebra point of view. Consider a
new vector space where each document is a matrix of n times m elements.
| i | like | klingon | semantic | tags |
------------------------------------------------------------------
sdlfkl | | | | | |
sdlfkl/sdflsdlkfj | | | | | |
sdlfkl/sdflsdlkfj/duidfkj | | | | | |
where each element e(i,j) is a function of the "relevance" of the term
in that particular context.
The most obvious solution is to keep the vector space as it is and value
document distance from the scalar product of the document (this matrix)
and the query (another matrix).
The problem is that user queries are normally very small and, for sure,
rarely contextualized (also because they don't know what contexts are
available, nor it is possible to provide a complete list of those
contexts, just like you don't provide the list of indexed terms).
So, a fully contextualized query would be very efficient, but would
require information to the user that is not generally available
(consider something for the general users, not for experts).
A better solution would "project" this n*m-dimensional space into an
n-dimensional space.
The advantage of this is that users can perform queries without
indicating information on the context the terms are found in.
At the same time, this "projection" must be done in such a way that the
context information is not "wasted".
In mathematical terms, this projection is a function p: n*m -> n that
'collapses' the other m-dimensions (those of the markup contexts) into
the remaining n (those of the original terms).
It could be seen as a geometrical way to 'enhance' the relevance
information for each term, using the contextual information.
The key point is the projecting function, so let's see what we can come
up with:
1) addictive projection is the easiest: each column of the matrix is
summed. So
m
---
V = \ M for each i into [1,n]
i / i,j
---
j = 1
the result is that contextual information is not taken into
consideration, thus totally wasted. Given the time/energy/money
resources invested in creating such semantically-marked-up content,
wasting it completely is an extremely poor way of indexing such a
content.
2) semantic relevance rated projection: suppose you have a way to obtain
a numerical value associated with each context. This number identifies
an index of "semantic relevance" for each context (how this is obtained
is another concern and let's ignore it for now).
Thus, rated projection is a weighted sum where weight are given by a
M-dimensional semantic relevance vector associated with the context
data.
m
---
V = \ w * M for each i into [1,n]
i / j i,j
---
j = 1
where
M = relevance of the i-th term in the j-th context [n*m]
i,j
V = projected relevance of the i-th term [n]
i
w = relevance weight of the j-th context [m]
j
The above projection is the well-known matrix product and could be
written as
V = M * w
where
+---------+ +---+
+-------+ | ....... | | . |
| ..V.. | = | ...M... | * | w |
+-------+ | ....... | | . |
+---------+ +---+
[NOTE: in general terms, the above projection could be augmented with a
weight matrix of [mxn] dimension, but this would mean to have a way to
indicate the relevance of each term on each specific context and it's
clearly overwelming since it would require collection-specific tuning
instead of markup-schema specific tuning which is much less expensive to
edit]
- o -
As you can see, once projection is performed there is no difference
between the previous text-based vector space and this new one. In
general terms, one could think of this projection system as a more
complex function to come up with the 'term relevance' that is normally
associated only with frequency.
So, architecturally, it could be added to Lucene by making the
vector-space generator pluggable (or at least, extensible).
What do you think?
--
Stefano Mazzocchi One must still have chaos in oneself to be
able to give birth to a dancing star.
Friedrich Nietzsche
--------------------------------------------------------------------
--
To unsubscribe, e-mail:
For additional commands, e-mail: