Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Date: Sat, 3 Aug 2002 23:46:50 +0200
Subject: Re: text format and scoring
Content-Type: text/plain; charset=US-ASCII; format=flowed
Mime-Version: 1.0 (Apple Message framework v482)
From: petite_abeille <petite_abeille@mac.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Content-Transfer-Encoding: 7bit
In-Reply-To: <20020803211326.79060.qmail@web11902.mail.yahoo.com>
Message-Id: <844B6214-A72A-11D6-929A-000393760B7E@mac.com>

Hi Alex,

On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote:

> Hi PA! How are things going?

Doing all right :-)

>
> It's an interesting question but I don't think Lucene
> (as it is today) could change weights based on
> semantics (either assigned by formatting tags or maybe
> looked up in some dictionary like WordNet)...

Ummm... I see.

>
> Some time ago, Doug sent to this list the formula for
> the score computation which is:

Thanks.


> The only thing that counts is the frequency of the
> terms in the document and among documents.
>
> A way to influence the final score might be to tweak
> the real frequencies during indexing with some
> parameters configured externally. Let's say if the
> word is underlined then multiply its count by X. This
> modified TF should influence the final score
> accordingly.
>
> Just a thought...

I see. That's what I'm basically doing right now somehow: I index a 
document multiple time (eg an email could be indexed by subject, first 
sentence and body content). Then I do multiple searches. And use a 
"ranking comparator" to evaluate the result based on how many time I get 
a specific document plus its Lucene scores and other funky heuristics. 
Which seems to work ok, but is kind of cumbersome :-( Same deal for 
finding "related" document. Lucene is very good for finding "similar" 
document, but for "related" (think "cluster" ;-), I basically end up 
doing some term categorization and assign some multiplying factor for 
each term category. Which then I feed to Lucene to get something more 
akin to a "cluster" of document...

In any case, I was simply wandering if there was a more straightforward 
way of doing things.

Cheers,

PA.


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>