lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem
Date Thu, 07 Jul 2005 22:52:06 GMT

On Jul 7, 2005, at 3:16 PM, Mark Bennett wrote:
> Scanning their paper very quickly, I didn't see a specific mention  
> (though I
> might have missed it) of extremely short documents (< 5 words).

The study does not concern itself with different document lengths.   
They chose 6 different collections, but it appears that they were  
looking for a diversity of authorship and subject matter.

> Was there
> something specific about 1 and 2 word documents you had in mind?

Could you use a negative document boost on 1 and 2 word docs to solve  
your particular problem?

After pondering the clip method a little more, I've become wary of  
its effect on title fields.  It would work very well on what you  
refer to as "main" and I generally call "bodytext", but if it were  
set as a default, it would become necessary to weight "title" fields  
or short "keywords" fields more heavily.

I think it would be possible, even desirable, to turn on clipping for  
bodytext while turning it off for title/keywords.  That would require  
the implementor to be familiar with scoring formula theory, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message