Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 62553 invoked from network); 6 Dec 2001 13:49:03 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 6 Dec 2001 13:49:03 -0000 Received: (qmail 19535 invoked by uid 97); 6 Dec 2001 13:48:50 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 19366 invoked by uid 97); 6 Dec 2001 13:48:46 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 19273 invoked from network); 6 Dec 2001 13:48:44 -0000 Message-ID: <3C0F67E7.B01CA125@apache.org> Date: Thu, 06 Dec 2001 13:43:19 +0100 From: Stefano Mazzocchi X-Mailer: Mozilla 4.78 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: Apache Lucene Subject: Relevance boosting with the aid of semantic markup Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hello everybody, first of all, let me state that I've looked into Lucene internals (and read all Doug's papers) and I'm impressed by the elegance of the architecture design, the resulting flexibility of the engine and the impressive performance and memory use. Outstanding. Now I would like to know your opinion on something. Suppose we have some content like this: Document1: This is a paragraph about Lucene Document2: This is a paragraph about Lucene Now we search for "lucene". The optimal result would be to have Document1 rated higher than Document2 since idenfities a more important result. I don't think this is currently possible with Lucene algorithms (since they are based on monodimensional text, while markup adds at least another dimensional), but I'd love to be wrong since I'm lazy :) Anyway, a possible solution would be to add the ability of add a 'boost-factor' to each token so that the Scorer can perform hits rating based on this information (the search phase could not be influenced by this boost factors). If this is possible, it would be much easier to perform XML indexing with Lucene without loosing the semantic contextual information that markup can convey. Comments? -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. Friedrich Nietzsche -------------------------------------------------------------------- -- To unsubscribe, e-mail: For additional commands, e-mail: