lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Relevance boosting with the aid of semantic markup
Date Thu, 06 Dec 2001 12:43:19 GMT
Hello everybody,

first of all, let me state that I've looked into Lucene internals (and
read all Doug's papers) and I'm impressed by the elegance of the
architecture design, the resulting flexibility of the engine and the
impressive performance and memory use.

Outstanding.

Now I would like to know your opinion on something.

Suppose we have some content like this:

Document1:
 <paragraph>This is a paragraph about
<strong>Lucene</strong></paragraph>

Document2:
 <paragraph>This is a paragraph about Lucene</paragraph>

Now we search for "lucene".

The optimal result would be to have Document1 rated higher than
Document2 since <strong> idenfities a more important result.

I don't think this is currently possible with Lucene algorithms (since
they are based on monodimensional text, while markup adds at least
another dimensional), but I'd love to be wrong since I'm lazy :)

Anyway, a possible solution would be to add the ability of add a
'boost-factor' to each token so that the Scorer can perform hits rating
based on this information (the search phase could not be influenced by
this boost factors).

If this is possible, it would be much easier to perform XML indexing
with Lucene without loosing the semantic contextual information that
markup can convey.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message