Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 18697 invoked from network); 3 Aug 2002 21:46:49 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 3 Aug 2002 21:46:49 -0000 Received: (qmail 20586 invoked by uid 97); 3 Aug 2002 21:47:12 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 20570 invoked by uid 97); 3 Aug 2002 21:47:12 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 20558 invoked by uid 98); 3 Aug 2002 21:47:11 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Date: Sat, 3 Aug 2002 23:46:50 +0200 Subject: Re: text format and scoring Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v482) From: petite_abeille To: "Lucene Users List" Content-Transfer-Encoding: 7bit In-Reply-To: <20020803211326.79060.qmail@web11902.mail.yahoo.com> Message-Id: <844B6214-A72A-11D6-929A-000393760B7E@mac.com> X-Mailer: Apple Mail (2.482) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Alex, On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote: > Hi PA! How are things going? Doing all right :-) > > It's an interesting question but I don't think Lucene > (as it is today) could change weights based on > semantics (either assigned by formatting tags or maybe > looked up in some dictionary like WordNet)... Ummm... I see. > > Some time ago, Doug sent to this list the formula for > the score computation which is: Thanks. > The only thing that counts is the frequency of the > terms in the document and among documents. > > A way to influence the final score might be to tweak > the real frequencies during indexing with some > parameters configured externally. Let's say if the > word is underlined then multiply its count by X. This > modified TF should influence the final score > accordingly. > > Just a thought... I see. That's what I'm basically doing right now somehow: I index a document multiple time (eg an email could be indexed by subject, first sentence and body content). Then I do multiple searches. And use a "ranking comparator" to evaluate the result based on how many time I get a specific document plus its Lucene scores and other funky heuristics. Which seems to work ok, but is kind of cumbersome :-( Same deal for finding "related" document. Lucene is very good for finding "similar" document, but for "related" (think "cluster" ;-), I basically end up doing some term categorization and assign some multiplying factor for each term category. Which then I feed to Lucene to get something more akin to a "cluster" of document... In any case, I was simply wandering if there was a more straightforward way of doing things. Cheers, PA. -- To unsubscribe, e-mail: For additional commands, e-mail: